Current approaches to AI coding agents appear to blur the lines between the Large Language Model (LLM) and the agent itself, asking the LLM to make decisions best left to deterministic processes. This leads to systems prone to stochastic failures such as gaming unit tests or hallucinating syntax. Drawing on established software engineering practices that provide deterministic frameworks for managing unpredictable processes, this paper proposes setting the control boundary such that the LLM is treated as a component of the environment environment—preserving its creative stochasticity—rather than the decision-making agent.

A Dual-State Architecture is formalized, separating workflow state (deterministic control flow) from environment state (stochastic generation). Atomic Action Pairs couple generation with verification as indivisible transactions, where Guard Functions act as sensing actions that project probabilistic outputs onto observable workflow state.

The framework is validated on three code generation tasks across 13 LLMs (1.3B–15B parameters). For qualified instruction-following models, task success rates improved by up to 66 percentage points at 1.2–2.1 $\times$ baseline computational cost. The results suggest that architectural constraints can substitute for parameter scale in achieving reliable code generation.

Keywords: Neuro-symbolic AI, LLM Agents, Runtime Verification, Code Generation, Iterative Refinement, Software Engineering

1 Introduction

1.1 Motivation: A Formal Building Block for Software Engineering

Modern software engineering has evolved sophisticated practices for managing complexity and uncertainty: CFEngine for infrastructure convergence, XP for iterative development, CI/CD for continuous validation, and Domain-Driven Design for semantic boundaries. These practices share a common thread—they provide deterministic frameworks for managing inherently unpredictable processes.

Large Language Models present a parallel challenge: substantial capability with stochastic behavior. While "Attention is All You Need" launched the transformer revolution, experience suggests that Attention is Not All You Need for Software Engineering—robust systems also require the scaffolding of verification, convergence, and bounded autonomy. Just as CFEngine treats infrastructure as eventually consistent rather than immediately correct, LLM outputs are treated in this framework as eventually valid through iterative refinement.

This work does not propose a revolutionary new algorithm, but rather a formalization of these emerging architectural patterns. The contribution is a theoretical grounding of these heuristics—providing the vocabulary, convergence guarantees, and formal reasoning framework required to transform ad-hoc "guardrails" into rigorous engineering disciplines.

Specifically, this work formalizes the separation of deterministic control flow from stochastic content generation. Through Atomic Action Pairs (inseparable generation-verification units) and a Dual-State Solution Space (workflow state versus environment state), the framework enables LLMs to operate within traditional software engineering bounds. Each verification failure provides feedback that refines subsequent generation attempts, achieving reliability through iteration rather than perfection.

Additionally, the architecture provides a potential approach to the Credit Assignment Problem inherent in LLM training. By enforcing immediate verification the framework naturally generates immediate reward signals attributed to specific generation attempts, rather than the sparse rewards typical of end-to-end generation. These verified traces can support both online reinforcement learning and offline supervised fine-tuning (e.g., via LoRA). Over time, this could theoretically reduce retry rates from the observed 2.1× toward 1.0× as models internalize domain-specific constraints.

Importantly, this architectural approach enables reliable systems using smaller, locally-deployed models (< 15B parameters) rather than requiring API access to frontier models. By substituting architectural rigor for parameter scale, organizations can maintain control over their software development pipeline while achieving high reliability.

1.2 Theoretical Context and Prior Art

The control dynamics of this framework draw from convergence patterns first encountered in configuration management, where CFEngine was leading the way in applied research [1, 2] and subsequently formalized as Promise Theory [3, 4]. In production systems, CFEngine demonstrated that reliability emerges not from commanding distributed components but from continuous convergence toward desired states—autonomous agents make promises rather than receive commands. This practical insight, later abstracted by Burgess into Promise Theory, provides the theoretical lens for managing stochastic systems: treat unreliable components as "promisers" and architect convergence operators to guide them toward validity.

Applied to LLMs, this convergence paradigm operates within the classical agent-environment boundary defined by Sutton & Barto [5], where the agent comprises only components modifiable by the control policy. Since the LLM’s weights cannot be modified during inference, it resides in the environment, with the agent function—per Russell & Norvig’s formalization [6]—mapping percepts (generation outputs) to actions (verification decisions).

Against this theoretical foundation, prior approaches to LLM control generally fall into two categories:

External Control Architectures (Symbolic & Hybrid):: Frameworks such as HTN planning [7] and LLM-Modulo [8] maintain deterministic control outside the model. While logically sound, they rely on explicit causal models (preconditions/effects) which are notoriously difficult to extract from the latent space of black-box LLMs.
Internal Control Architectures (LLM-Centric):: Conversely, techniques like Chain-of-Thought [9] and ReAct [10] locate the control loop inside the stochastic generation window. While flexible, these methods suffer from probabilistic control flow, where the agent’s decision-making process is subject to the same hallucination modes as its content generation.

This framework synthesizes these perspectives by utilizing external symbolic guards to enforce the internal convergence of generative promises.

1.3 Framework Overview

Building on goal-based agent frameworks [11], this work introduces a mechanism to externalize reasoning. The definition of an action is extended to utilize guard functions—not merely as gates, but as active postcondition validators that project the LLM’s internal generation process onto a verifiable external state.

These sensing actions enable a dual-state architecture that provides:

•

Observable Workflow Reasoning: Unlike opaque internal monologues (e.g., Chain-of-Thought), reasoning is captured in explicit state transitions, converting probabilistic generation into deterministic logical steps.
•

Bounded Indeterminacy: The system guarantees termination and cost control through deterministic validation predicates and finite retry budgets.
•

Atomic Composition: Generation and verification are treated as a single transactional unit—an "Atomic Action Pair"—ensuring that invalid content never pollutes the workflow state.

Figure 1: The Atomic Action Pair. The architecture enforces a strict separation between the Observable Workflow (left) and the Opaque Environment (right). The red loop illustrates the refinement transition: unlike standard backtracking, the workflow state

s_{w}

remains invariant while guard feedback

\phi

updates the generative context

C

. The dotted line indicates that the Guard conditions validation on both the artifact and the context.

2 Definitions

To rigorously formalize the relationship between deterministic workflows and stochastic generators, foundational concepts in agency are distinguished from the specific architectural interpretations employed in this framework.

2.1 Foundational Definitions

Promise Theory (Burgess, 2015): A model of voluntary cooperation where autonomous agents issue promises regarding intended behavior rather than guarantees. Interactions are defined by the consumer’s responsibility to verify promise fulfillment, replacing command-and-control assumptions.
Agent (Russell & Norvig, 1995): A function $f:\mathcal{P}^{*}\to\mathcal{A}$ mapping a complete percept history to actions. A rational agent selects actions that maximize an expected performance measure given its percept sequence and built-in knowledge.
Control Boundary (Sutton & Barto, 1998): The boundary defining the agent comprises only those components that can be arbitrarily modified by the control policy. Components outside this boundary constitute the environment.
Bounded Rationality (Simon, 1955): Rational agents operating under computational constraints do not optimize for the global maximum; instead, they satisfice, selecting the first solution that meets the aspiration level (validity criteria) within the available search budget.
Weak Agency (Wooldridge & Jennings, 1995): A software system exhibiting autonomy, reactivity, and pro-activeness, without implying consciousness or mental states.

2.2 Architectural Definitions

Control Boundary (Generative Application)

This framework applies Sutton & Barto’s definition to the resource-dependent nature of Large Language Models (LLMs). The agent boundary is defined by modifiability relative to the time horizon:

•

Intra-Episode: The agent controls context composition ( $C$ ) and state transitions ( $S_{workflow}$ ).
•

Inter-Episode: With sufficient compute, the agent may control adapter parameters (e.g., LoRA) or distilled weights.
•

Base Model: The pre-trained weights remain in the environment, providing a stochastic generation oracle that functions as the fixed generative component.

Goal-Based Agent (Deterministic Controller)

A rational decision function $f:\mathcal{P}^{*}\to\mathcal{A}$ that treats stochastic generations as percepts rather than actions. The agent observes the opaque output of the LLM (environment) and executes deterministic state transitions (actions) based on verification results. This ensures that while the content is stochastic, the control flow remains strictly deterministic.

Neuro-Symbolic Agentic System

A software architecture integrating neural generation with symbolic verification. The LLM (as a component of the environment) issues generation promises; deterministic guard functions verify promise fulfillment within an atomic transaction, ensuring that invalid states are never committed to the persistent workflow history.

Dual-State Architecture

An implementation pattern that separates the system state space $\mathcal{S}$ into two distinct spaces:

•

$\mathcal{S}_{workflow}$ (Control State): A deterministic, finite state machine tracking goal progress and guard satisfaction.
•

$\mathcal{S}_{env}$ (Information State): An append-only versioned repository of generation history, artifacts, and guard feedback, enabling in-context learning without polluting the control flow.

3 Formal Framework

3.1 Dual State Space

Definition 1 (State Space Decomposition).

The system state space $\mathcal{S}$ is decomposed into an observable workflow space and an opaque environment space:

\mathcal{S}=\mathcal{S}_{\text{workflow}}\times\mathcal{S}_{\text{env}}

(1)

•

Workflow State ( $\mathcal{S}_{\text{workflow}}$ ): Defined as the set of all truth assignments to the guard functions:

$\mathcal{S}_{\text{workflow}}=\{\sigma\mid\sigma:\mathcal{G}\to\{\bot,\top\}\}$ (2)

where $\mathcal{G}=\{g_{1},...,g_{n}\}$ is the set of unique guard identifiers.
•

Environment State ( $\mathcal{S}_{\text{env}}$ ): Defined as the Cartesian product of the artifact space and context space:

$\mathcal{S}_{\text{env}}=\mathcal{A}\times\mathcal{C}$ (3)

A specific environment state is denoted as a tuple $s_{\text{env}}=\langle a,C\rangle$ , where $a\in\mathcal{A}$ is the current artifact (mutable result) and $C\in\mathcal{C}$ is the cumulative context (immutable history).

Remark 1 (Information Abstraction).

The workflow state acts as a finite abstraction of execution progress. While the guard function returns detailed feedback $\phi\in\Sigma^{*}$ (e.g., compiler logs), this information is projected into the opaque Context ( $\mathcal{C}$ ). Only the binary satisfaction signal $v\in\{\bot,\top\}$ is retained in $\mathcal{S}_{\text{workflow}}$ , preserving the finiteness of the planning space.

3.2 Artifacts, Context, and Provenance

To ensure auditability and enable effective backtracking, the environment is formalized not as a mutable store, but as an append-only versioned repository.

Definition 2 (Artifact Space & Versioning).

Let $\mathcal{A}$ be the set of all possible concrete outputs. A Versioned Repository $\mathcal{R}$ is defined as a Directed Acyclic Graph (DAG) where nodes represent artifact versions $a_{v}$ and edges represent derivation steps.

\mathcal{R}=\{(a_{0},\dots,a_{k})\mid a_{i}\in\mathcal{A}\}

Every generative action creates a new node in $\mathcal{R}$ rather than overwriting the previous state. This strictly preserves the failure history (“rejected branches”) for future learning.

Definition 3 (Hierarchical Context Composition).

The context $\mathcal{C}$ conditioning the generator and available to the guard is the composition of three distinct scopes:

C_{total}=\langle\mathcal{E},C_{local},H_{feedback}\rangle

(4)

•

Ambient Environment ( $\mathcal{E}=\langle\mathcal{R},\Omega\rangle$ ): Contains the Versioned Repository $\mathcal{R}$ (providing read-only access to all finalized ancestor and cousin artifacts) and Global Constraints $\Omega$ .
•

Local Context ( $C_{local}=\langle\Psi,a_{k}\rangle$ ): The active scope for the current planning node, containing the Static Specification $\Psi$ (requirements/tests for this specific step) and the Current Artifact $a_{k}$ .
•

Feedback History ( $H_{feedback}$ ): The accumulated sequence of guard rejections for this specific node: $H=[(a_{k},\phi_{k}),\dots]$ .

Remark 2 (Context Isolation).

By explicitly separating $\mathcal{E}$ from $H_{feedback}$ , it is ensured that hallucinations or failures in a sub-task do not pollute the global context. When the workflow backtracks, $H_{feedback}$ is cleared and $a_{k}$ is reverted, but the Ambient Environment $\mathcal{E}$ and Specification $\Psi$ remain invariant.

3.3 State Evolution Logic

To preserve the finiteness of the planning space while enabling learning, a distinction is made between the Control State (Workflow) and the Information State (Context).

Definition 4 (Workflow Stability).

The workflow state $s_{w}$ is invariant under guard failure. That is, if the guard returns $\bot$ , the control state does not transition:

T(s_{w},\bot)=s_{w}

Progress in $\mathcal{S}_{\text{workflow}}$ occurs exclusively upon guard satisfaction ( $T(s_{w},\top)$ ).

Definition 5 (Context Refinement).

While the workflow state remains stable on failure, the context evolves to capture error signal $\phi$ . Let $C_{k}=\langle\Psi,H_{k}\rangle$ be the context at attempt $k$ . The transition is defined as:

C_{k+1}=\langle\Psi,H_{k}\cup\{(a_{k},\phi_{k})\}\rangle

This ensures that while the planner remains at the same node, the generator’s conditioning changes monotonically.

3.4 Action Pairs & Preconditions

Definition 6 (Action Pair).

An action is defined as a tuple $A=\langle\rho,a_{gen},G\rangle$ representing the sequence of verification and execution:

•

$\rho:\mathcal{S}_{\text{workflow}}\to\{0,1\}$ is the Precondition (Entry Gate). It determines if the action pair is applicable in the current workflow state.
•

$a_{gen}:\mathcal{C}\to\mathcal{A}$ is the Generator (Execution). It consumes context to produce an artifact.
•

$G:\mathcal{A}\times\mathcal{C}\to\{\bot,\top\}\times\Sigma^{*}$ is the Guard (Exit Gate). It evaluates the artifact against the context to update the state or provide feedback.

Remark 3 (Guard Input Scoping).

While Definition 6 provides guards access to the full context $\mathcal{C}$ (to theoretically allow validation against any historical artifact), in practice, well-designed guards should accept only the minimal required inputs. The Workflow configuration is responsible for extracting specific artifacts from $\mathcal{R}$ (via $\mathcal{E}$ ) and passing them explicitly to the execution runtime, preserving guard simplicity and testability.

Definition 7 (System Dynamics).

The evolution of the full system state $s_{t}=\langle s_{w,t},s_{env,t}\rangle$ (where $s_{env,t}=\langle a_{t},C_{t}\rangle$ and $C_{t}=\langle\Psi,H_{t}\rangle$ ) upon executing action $A$ is defined as:

1.

Generation: The generator conditions on the current context $C_{t}$ to produce a new artifact $a^{\prime}$ :

$a^{\prime}\sim a_{gen}(C_{t})$
2.

Sensing: The guard evaluates the new artifact $a^{\prime}$ within context $C_{t}$ :

$\langle v,\phi\rangle=G(a^{\prime},C_{t})$
3.

State Update: The next state $s_{t+1}$ is determined by the guard result $v$ .

If $v=\top$ (Advance):

$s_{t+1}=\langle T(s_{w},v),\langle a^{\prime},\langle\Psi,\emptyset\rangle\rangle\rangle$ (5)

If $v=\bot$ (Refine):

$s_{t+1}=\langle s_{w},\langle a^{\prime},\langle\Psi,H_{t}\cup\{(a^{\prime},\phi)\}\rangle\rangle\rangle$ (6)

Definition 8 (Workflow Transition Function).

T(s_{w},v)=\begin{cases}s_{w}[g_{id}\mapsto\top]&\text{if }v=\top\\ s_{w}&\text{if }v=\bot\end{cases}

3.5 Planning Problem

Definition 9 (Guard-Based Planning Problem).

The planning problem is a tuple

\mathcal{P}=\langle\mathcal{S}_{\text{workflow}},\mathcal{A},s_{w0},C_{init},\mathcal{S}_{goal},R_{max}\rangle

where:

•

$s_{w0}\in\mathcal{S}_{\text{workflow}}$ is the initial control state (typically all guards $\bot$ ).
•

$C_{init}=\langle\Psi,\emptyset\rangle$ is the initial specification context.
•

$\mathcal{S}_{goal}\subseteq\mathcal{S}_{\text{workflow}}$ is the set of satisfying goal states.

A solution is a policy $\pi:\mathcal{S}_{\text{workflow}}\to\mathcal{A}\cup\{\text{wait},\text{term}\}$ that guarantees termination in $\mathcal{S}_{goal}$ within finite steps, subject to the generator capability assumption.

Remark 4 (Complexity Collapse).

Standard planning problems suffer from exponential search space explosion ( $O(B^{K})$ ). However, in industrial workflows where the task topology is fixed (i.e., sequential or DAG), the branching factor $B\approx 1$ . This framework exploits this by converting the Search Problem (finding a path) into a Reliability Problem (ensuring the path is traversable). The complexity is thus dominated by the Retry Limit ( $R_{max}$ ) rather than the workflow depth.

3.6 Execution Semantics

The execution of policy $\pi$ follows a refined control loop that handles context augmentation upon failure, as detailed in Algorithm 1.

Algorithm 1 EXECUTE-PLAN(

\pi,s_{w0},R_{max}

)

node\leftarrow\pi.root

s_{w}\leftarrow s_{w0}

context\leftarrow\text{INITIAL-CONTEXT}()

trace\leftarrow[]

retries\leftarrow 0

6: while

node\neq None

A\leftarrow\text{SELECT-ACTION}(node,s_{w})

8: if

retries\geq R_{max}

then

9: return

(trace,s_{w},\text{FAILURE})

10: end if

11:

art\leftarrow\text{EXECUTE-GEN}(A.a_{gen},context)

12:

(status,feedback)\leftarrow\text{EVAL-GUARD}(A.G,art,context)

13:

trace.append((s_{w},art,status))

14: if

status=\bot

then

15:

context\leftarrow\text{AUGMENT}(context,art,feedback)

16:

retries\leftarrow retries+1

17: continue

18: else

19:

s_{w}\leftarrow\text{TRANSITION}(s_{w},status)

20:

context\leftarrow\text{CLEAR-FEEDBACK}(context)

21:

node\leftarrow\text{NEXT-NODE}(node,status)

22:

retries\leftarrow 0

23: end if

24: end while

25: return

(trace,s_{w},\text{IS-GOAL}(s_{w}))

4 Theoretical Results

To establish convergence guarantees, the generator’s competence is formalized as a probability bound.

Assumption 1 (Generator $\epsilon$ -Capability).

It is assumed that for any valid specification context $C$ , the generator has a non-zero probability of producing an artifact that satisfies the guard. Formally, there exists an $\epsilon>0$ such that:

\forall C,\quad P\left(\text{proj}_{1}(G(a_{gen}(C),C))=\top\right)\geq\epsilon

where $\text{proj}_{1}$ extracts the boolean status from the guard tuple.

Lemma 1 (Artifact Determinism Sufficiency).

If a generator produces artifact $a$ , then guard evaluation $G(a,C)$ is deterministic regardless of whether the generator itself is deterministic.

Proof.

Since the artifact $a$ is immutable once produced, and $C$ is fixed during the evaluation step, and $G$ is a function mapping inputs to outputs, $G(a,C)$ must yield the same result for repeated evaluations. ∎

Proposition 1 (Workflow State Projection).

The set of guard functions $\mathcal{G}$ defines a deterministic projection $\Gamma:\mathcal{S}_{\text{env}}\to\mathcal{S}_{\text{workflow}}$ . That is, for any opaque environment state $s_{\text{env}}=\langle a,C\rangle$ , there exists exactly one corresponding observable workflow state $s_{w}$ .

Proof.

Let $\mathcal{G}=\{g_{1},g_{2},\dots,g_{N}\}$ be the ordered set of guard functions. By Lemma 1, for a fixed artifact $a\in\mathcal{A}$ and context $C$ , the evaluation of each guard $g_{i}(a,C)$ is deterministic. The evaluation vector $\mathbf{v}\in\{\perp,\top\}^{N}$ is defined such that $\mathbf{v}_{i}=\text{proj}_{1}(g_{i}(a,C))$ . The workflow state $s_{w}$ is defined as the truth assignment mapping $\sigma:\mathcal{G}\rightarrow\{\perp,\top\}$ , which is isomorphic to the vector $\mathbf{v}$ . Since each component $g_{i}(a,C)$ is deterministic, the vector $\mathbf{v}$ (and thus the state $s_{w}$ ) is unique for any given $s_{env}=\langle a,C\rangle$ . Thus, $\Gamma$ is a well-defined function. ∎

Proposition 2 (Finite Search Space via AND-OR Trees).

Let

\mathcal{S}_{reach}=\{s\in\mathcal{S}_{\text{workflow}}\mid\exists\text{ path from }s_{w0}\text{ to }s\text{ under }T\}

be the set of reachable workflow states. The planning problem maps to a finite AND-OR tree search over $\mathcal{S}_{reach}$ , where solution existence is decidable.

Proof.

Step 1 (Tree Construction): A search tree $\mathcal{T}$ is constructed, analogous to standard AO* search, where OR-Nodes represent agent decisions $(a_{gen},G)$ and AND-Nodes represent environmental responses $\{pass,fail\}$ .

Step 2 (Finiteness): A ’fail’ outcome from the Guard action increments the local retry counter $r$ while the workflow state $s_{w}$ remains constant (Definition 5). The generator action is valid (applicable) iff $r<R_{max}$ . Thus, no branch can exceed $R_{max}$ consecutive failures, and the total maximum depth is bounded by $\sum R_{max}$ across all guards.

Step 3 (Decidability): Since the depth is bounded by the retry limit and the set of reachable states $\mathcal{S}_{reach}$ is finite (bounded by the workflow specification), the total tree size $|\mathcal{T}|$ is finite.

∎

Proposition 3 (Generator Independence).

Plan correctness depends only on (a) guard predicate specifications and (b) the generator’s capability to eventually produce valid artifacts, independent of the generator’s internal mechanism.

Proof.

By Definition 8 (System Dynamics), the workflow state transition function $T(s_{w},v)$ depends exclusively on the current observable state $s_{w}$ and the guard result $v$ . The guard function $G(a,C)$ evaluates the artifact $a$ directly, treating the generation process as a black box. While the generator’s internal mechanism determines the probability distribution of $a$ , Assumption 1 guarantees that this distribution has non-zero support for valid artifacts ( $\epsilon>0$ ). Therefore, the logic governing state advancement and goal satisfaction $S_{goal}$ operates independently of the generator’s internal state space or transition probabilities. ∎

Proposition 4 (Asymptotic Soundness).

Given an $\epsilon$ -capable generator ( $P(pass)\geq\epsilon>0$ ) and a finite contingent plan $\pi$ , the probability of failure approaches 0 as $R_{max}\to\infty$ .

Proof.

Let $p=P(pass)$ be the probability of success for a single attempt. In the worst-case scenario where the generator is memoryless (i.e., it does not learn from the error context), the probability of node failure after $R_{max}$ attempts is $(1-p)^{R_{max}}$ . Since $p\geq\epsilon$ , the upper bound on failure probability is:

P(\text{fail})\leq(1-\epsilon)^{R_{max}}

(7)

Since $(1-\epsilon)<1$ :

\lim_{R_{max}\to\infty}(1-\epsilon)^{R_{max}}=0

∎

Remark 5 (Conservative Bound).

The proof above assumes a "memoryless" generator (worst-case). In practice, because context $C$ accumulates feedback $\phi$ on every failure (Definition 6), the conditional probability of success typically increases with retries ( $P(pass|C_{k+1})>P(pass|C_{k})$ ). Thus, Inequality 7 represents a conservative lower bound on reliability.

Corollary 1 (Reliability Bound).

To achieve a target global reliability $\delta$ (where $0<\delta<1$ ), the minimum retry limit is:

R_{max}\geq\frac{\ln(1-\delta^{1/K})}{\ln(1-\epsilon)}

(8)

where $K$ is the number of sequential steps in the workflow.

Corollary 2 (Complexity Bound).

The worst-case control complexity is linear with respect to the reachable state space size:

O(|\mathcal{S}_{reach}|\times R_{max}\times|\mathcal{G}|)

For sparse sequential workflows where $|\mathcal{S}_{reach}|\ll 2^{|\mathcal{G}|}$ , this ensures tractability.

5 Complexity Analysis

Computational complexity is analyzed relative to the Workflow State Space ( $\mathcal{S}_{\text{workflow}}$ ) rather than the Environment State Space ( $\mathcal{S}_{\text{env}}$ ).

5.1 State Space Abstraction

Standard generative planning operates in $\mathcal{S}_{\text{env}}$ , the space of all possible artifacts (e.g., all valid Unicode strings). Since $|\mathcal{S}_{\text{env}}|\approx\infty$ , standard MDP planning algorithms with complexity $O(|S|^{2}|A|)$ are intractable.

Execution is projected into $\mathcal{S}_{\text{workflow}}$ , defined by the boolean status of $N$ guards. While the theoretical size of this space is $2^{N}$ , it is observed that industrial workflows typically follow a strict sequential or Directed Acyclic Graph (DAG) structure.

|\mathcal{T}|\ll 2^{N}

(9)

The Reachable Execution Tree $\mathcal{T}$ is defined as the subset of states reachable from $s_{init}$ under the transition function $T$ . For a sequential workflow of length $K$ , the reachable space is linear in $K$ .

Remark on Generator Cost: While generator execution is treated as an atomic action in the planning layer, each transition in $\mathcal{S}_{\text{workflow}}$ corresponds to a potentially computationally intensive operation in $\mathcal{S}_{\text{env}}$ (e.g., LLM inference). However, this cost is constant per node visit and does not affect the asymptotic complexity of the search algorithm itself.

5.2 Planning Complexity

Given $N$ guards and a retry limit $R_{max}$ , the planning problem reduces to finding an optimal policy in the finite AND-OR tree $\mathcal{T}$ . The size of this tree is bounded by:

|\mathcal{T}|\leq N\times R_{max}

Finding the optimal policy involves a single traversal of this finite tree (e.g., via backward induction), with a computational cost of $O(|\mathcal{T}|)$ . This represents a reduction from infinite/intractable ( $s_{\text{env}}$ ) to linear/polynomial complexity ( $S_{workflow}$ ) with respect to the workflow length, conditioned on the assumption that the workflow structure is sparse.

6 Experimental Validation

6.1 Illustrative Validations

The primary contribution of this work is the formal Dual-State Framework and the concept of Atomic Action Pairs. To validate this architecture, a set of Diagnostic Probes was selected rather than broad industry benchmarks such as HumanEval or SWE-Bench.

This decision was driven by two factors. First, as an independent researcher with limited computational resources (time and GPU availability), the goal was to isolate the architectural mechanism efficiently rather than benchmark model parameters at scale. Second, and more critically, current benchmarks are not designed to evaluate agentic control flow in the context of formal verification.

For instance, while SWE-Bench provides realistic software engineering tasks, applying the framework would require a pre-existing library of Guard Actions (formal specifications or executable tests) mapped to each repository issue. Constructing such a “Guard Library” is a non-trivial engineering challenge and represents a distinct avenue for future research (potentially involving a meta-policy that selects guards dynamically).

Consequently, this experimental validation is illustrative. It is designed to answer a single mechanistic question: Does the introduction of deterministic guards enable a stochastic model to solve problems it could not solve zero-shot?

To this end, three tasks were selected (see Appendix C). It is hypothesized that while the models possess high prior knowledge (“High Priors”) for the concepts behind all three, they differ significantly in implementation difficulty:

•

LRU Cache: The model is expected to have High Priors for both the concept and the implementation. The pattern (Hash Map + Linked List) is standard in training data. The challenge is simply maintaining state without drift.
•

Template Engine: The model is expected to have High Priors in Concept (knowing what a template engine is), but Low Priors in Implementation. Since there is no single standard way to write a parser from scratch, the model must synthesize a novel solution rather than recalling a memorized one.
•

Password Validator: The model is expected to have High Priors in Concept but faces a Calculation Gap. While the rules are simple to state, satisfying them requires mathematical operations (e.g., Prime Number calculation) that are difficult for token-prediction models to implement correctly.

In choosing these tasks, specific Atomic Action Pairs are isolated to validate that deterministic guards can enforce reliability on a single stochastic generation node. Data was collected to test two primary hypotheses:

•

H1 (Reliability): The dual-state, guard-based planning framework significantly increases task success rates compared to a standard, single-attempt baseline.
•

H2 (Efficiency): The increase in reliability is achieved with modest additional generation attempts, and this efficiency varies predictably with model scale.

Task	Prior Knowledge vs. Implementation Gap	Guard Role	Expected Behavior
LRU Cache	High Prior / Standard Implementation. The model likely memorized this pattern during training.	Drift Prevention. The guard ensures the model doesn’t make careless errors (hallucinations) in a known pattern.	High Baseline Success. It is expected for competent models to solve this easily; the guard mainly fixes minor slip-ups.
Template Engine	High Concept / Novel Implementation. The model knows the concept but must invent the specific logic (parsing) on the fly.	Syntax Guide. The guard provides error messages that help the model fix its specific implementation choices.	Optimization. The model is expected to start with broken code and iterate toward a working solution using the feedback.
Password Validator	High Concept / Calculation Gap. The model knows the rules but is unable to calculate the math (Primes) required to satisfy them.	Hard Gating. The guard rejects invalid math.	Low success. This is expected to be the hardest task because knowing the definition of a prime number doesn’t help the model calculate one.

Table 1: Classification of Diagnostic Probes. These tasks were selected to test how the framework handles different types of “Implementation Gaps,” ranging from simple recall (LRU) to complex mathematical constraints (Password Validator).

Limitations of this Approach: It is acknowledged that these tasks are bounded and synthetic. However, by restricting the problem space, confounding variables are eliminated — such as library knowledge or prompt ambiguity—allowing success to be attributed directly to the architectural intervention (the Guard mechanism).

6.2 Model Qualification

A foundational premise of this work is that the Dual-State Framework is an architectural multiplier for capability, not a substitute for it. The framework requires a generator capable of instruction following—specifically, the ability to ingest a guard’s error trace and attempt a semantic correction.

To isolate this architectural effect, a model is deemed qualified if it consistently produces parsable output in a zero-shot prompt. Models that fail this basic threshold (e.g., raw FIM models, or those producing blank tokens) violate Assumption 1 ( $\epsilon\approx 0$ ). Including them would conflate architectural failure with model incapacity.

Specifically, the qualification threshold requires the model to adhere to the requested output format (Markdown code blocks) under a generic system prompt. Models that produce valid code but fail to wrap it in standard formatting (e.g., Markdown backticks) are classified as unqualified, as they fail the fundamental agentic requirement of Interface Compliance.

6.3 Experimental Setup

•
Workflow: The workflow for all tasks enforces structural correctness via a sequential Guard Validation Chain:
1. 1.
  
  Generation: The probabilistic output from the LLM.
2. 2.
  
  Syntax Validation: Enforced by SyntaxGuard (validating Python AST parsing).
3. 3.
  
  Functional Correctness: Enforced by TestGuard (validating functional correctness via unit tests).
Each guard failure triggers the refinement loop ( $s_{w}\to s_{w}$ ), preventing transition to the next logical state until the constraint is met.
•

Models & Runtime: 13 models from 6 families were evaluated (see Table 2). All models were executed locally using Ollama v0.12.3 to ensure a standardized, offline inference environment. The focus is deliberately on Small Language Models (SLMs) ( $<$ 15B parameters) to test the hypothesis that architectural constraints can substitute for parameter scale. Demonstrating high reliability on these lightweight models validates the framework’s ability to act as a capability multiplier, enabling secure, local execution for SDLC tasks without reliance on massive, proprietary APIs.
•

Hardware Specification: Experiments were conducted on a workstation equipped with an AMD Threadripper PRO (32 cores), 125GB RAM, and an NVIDIA RTX A4000 GPU. This setup accommodated the varying VRAM requirements of models ranging from 1.3B to 15B parameters without quantization loss beyond the standard 4-bit (q4_k_m) schema.
•

Inference Parameters: To verify the architecture’s ability to manage high variance, all models were sampled with temperature $T=0.7$ . This relatively high entropy setting ensures the generator is sufficiently "irrational," testing the framework’s ability to constrain a stochastic process.
•
Software Harness: The control loop was implemented in Python 3.12.11. To strictly isolate the "Control" from the "Generation," the harness enforces:
- –
  
  Context Isolation: A hard reset of the inference context window between trials.
- –
  
  Template Normalization: A unified system prompt is applied across all models, intentionally avoiding model-specific chat templates. This acts as a stress test for "Instruction Following Robustness"—models that rely on bespoke control tokens rather than natural language instructions fail the qualification step.
•
Guards ( $G$ ): A single comprehensive guard was implemented, $G_{\text{functional}}$ , which encapsulates:
1. 1.
  
  Execution Safety: Controlled process execution with a 60s timeout to prevent infinite loops.
2. 2.
  
  Functional Correctness: A suite of task-specific unit tests against the generated artifact.
3. 3.
  
  Diagnostic Feedback: If failure occurs, the guard returns specific error traces to guide retries.
•
Configurations: Two execution modes are compared:
- –
  
  Baseline (One-Shot): A single generation attempt ( $k=1$ ), measuring the model’s raw zero-shot capability.
- –
  
  Guarded (Agentic): The contingent planner with a retry limit of $R_{\text{max}}=3$ . This mode utilizes the guard’s diagnostic feedback to iteratively refine the artifact upon failure.

Table 2: Code generation models selected for the experiment, categorized by approximate size.

Category	Models
Large (9B+)	Qwen2.5-Coder (14B), StarCoder2 (15B), Phi4 (14B)
Medium (4-8B)	Yi-Coder (9B), Granite-Code (8B), Qwen2.5-Coder (7B), CodeGemma (7B), DeepSeek-Coder (6.7B)
Small (2-4B)	Qwen2.5-Coder (3B), Granite-Code (3B), Phi4-Mini (3.8B)
Tiny ( $<$ 2B)	Qwen2.5-Coder (1.5B), CodeGemma (2B), Yi-Coder (1.5B), DeepSeek-Coder (1.3B)

6.4 Results

50 independent trials were executed for each model across the three diagnostic probes. To quantify the architectural benefit of the Dual-State Framework, the Baseline Success (One-Shot, $k=1$ ) is reported, the Guarded Success ( $R_{max}=3$ ), and the Reliability Gain ( $\Delta$ ), which represents the absolute percentage point improvement attributable to the guard mechanism.

Table 3: Template Engine Results. High conceptual priors but structural implementation gaps. The table is sorted by Reliability Gain (

\Delta

) to show which models most effectively utilized the guard as a “Syntax Guide.” Statistical significance: ***

p<0.001

, **

p<0.01

, *

p<0.05

(Fisher’s exact test).

Model	Base ( $k=1$ )	Guarded ( $R=3$ )	Gain ( $\Delta$ )	Avg Retries
Yi-Coder (9B)	56%	98%	+42***	0.92
StarCoder2 (15B)	60%	100%	+40***	0.32
Qwen2.5-Coder (3B)	8%	42%	+34***	2.72
Qwen2.5-Coder (7B)	70%	98%	+28***	0.38
Granite-Code (8B)	50%	76%	+26*	1.46
Phi4 (14B)	8%	26%	+18*	2.26
Qwen2.5-Coder (14B)	86%	100%	+14*	0.20
DeepSeek-Coder (6.7B)	4%	18%	+14	3.10
Unqualified ( $\epsilon=0$ )*	0–2%	0–6%	$\approx 0$	$>3.5$

*DeepSeek-Coder (1.3B), Phi4-Mini (3.8B), Yi-Coder (1.5B), Qwen2.5-Coder (1.5B)

Table 4: LRU Cache Results. High priors for both concept and implementation. The table is sorted by Reliability Gain (

\Delta

), illustrating how guards close the reliability gap for mid-sized models (e.g., DeepSeek-Coder 6.7B) by catching stochastic drift. Statistical significance: ***

p<0.001

, **

p<0.01

, *

p<0.05

(Fisher’s exact test).

Model	Base ( $k=1$ )	Guarded ( $R=3$ )	Gain ( $\Delta$ )	Avg Retries
DeepSeek-Coder (6.7B)	48%	98%	+50***	0.76
Granite-Code (8B)	60%	98%	+38***	0.52
Yi-Coder (1.5B)	62%	98%	+36***	0.76
Qwen2.5-Coder (1.5B)	74%	98%	+24***	0.22
Granite-Code (3B)	80%	98%	+18**	0.38
StarCoder2 (15B)	86%	100%	+14*	0.10
Qwen2.5-Coder (7B)	92%	100%	+8	0.06
Qwen2.5-Coder (3B)	96%	100%	+4	0.18
Qwen2.5-Coder (14B)	98%	100%	+2	0.00
Phi4 (14B)	100%	100%	–	0.00
Yi-Coder (9B)	100%	100%	–	0.04
Phi4-Mini (3.8B)	60%	58%	-2	1.80
DeepSeek-Coder (1.3B)	0%	0%	–	3.94

Table 5: Password Validator Results. High conceptual priors but severe calculation gaps. The table is sorted by Reliability Gain (

\Delta

) to highlight the architectural impact on capable but arithmetically limited models (e.g., StarCoder2). Statistical significance: ***

p<0.001

, **

p<0.01

, *

p<0.05

(Fisher’s exact test).

Model	Base ( $k=1$ )	Guarded ( $R=3$ )	Gain ( $\Delta$ )	Avg Retries
StarCoder2 (15B)	0%	66%	+66***	0.84
DeepSeek-Coder (6.7B)	50%	96%	+46***	0.72
Granite-Code (3B)	36%	80%	+44***	1.46
Qwen2.5-Coder (1.5B)	14%	52%	+38***	1.88
Granite-Code (8B)	58%	94%	+36***	0.92
Yi-Coder (9B)	76%	100%	+24***	0.38
Yi-Coder (1.5B)	24%	36%	+12	2.54
Qwen2.5-Coder (7B)	90%	100%	+10	0.32
Qwen2.5-Coder (3B)	92%	100%	+8	0.28
Phi4 (14B)	98%	100%	+2	0.06
Phi4-Mini (3.8B)	0%	0%	–	2.18
DeepSeek-Coder (1.3B)	0%	0%	–	3.92

Note: Qwen2.5-Coder (14B) was excluded from this task due to data corruption during logging.

Refer to caption — Figure 2: Guarded Success Rate by Model and Task. Heatmap showing task-specific model qualification under the guarded configuration ( $R_{max}=3$ ). Green cells indicate high reliability ( $\geq$ 90%), yellow indicates partial success, and red indicates failure to converge. Notable patterns: DeepSeek-Coder (1.3B) shows uniform failure across all tasks ( $\epsilon=0$ ), establishing the canonical “unqualified” baseline; Phi4-Mini exhibits task-specific qualification (58% LRU, 0% password); the Qwen2.5-Coder (14B) password anomaly (0%) reflects data corruption rather than model capability.

6.5 Analysis

The expanded benchmark across 13 models (ranging from 1.3B to 15B parameters) reveals a nuanced capability landscape. Statistical significance was assessed using Fisher’s exact test, with effect sizes reported as Cohen’s h for proportions.

Template Engine (Structural Gap): This task exhibited the widest variance in guard effectiveness. Top performers achieved substantial gains: Yi-Coder (9B) improved from 56% to 98% ( $\Delta=+42\text{pp}$ , $p<0.001$ , Cohen’s $h=1.17$ ), while StarCoder2 (15B) reached perfect reliability from a 60% baseline. The template task proved most discriminating for smaller models—DeepSeek-Coder (6.7B) achieved only 18% guarded success ( $\Delta=+14\text{pp}$ , $p=0.051$ ), suggesting that instruction-following fidelity limits how effectively feedback can be utilized. The sub-3B models showed negligible improvement ( $\epsilon\approx 0$ ), establishing a clear capability threshold.

LRU Cache (Drift Prevention): The LRU task confirmed the framework’s efficiency for well-understood patterns. Eleven of thirteen models achieved $\geq$ 98% guarded success. Notable findings include:

•

DeepSeek-Coder (6.7B) showed the largest gain ( $\Delta=+50\text{pp}$ , $p<0.001$ , Cohen’s $h=1.33$ ), demonstrating that guards effectively close the reliability gap for mid-capability models.
•

Phi4-Mini (3.8B) exhibited anomalous behavior: a negative gain (-2pp) with high retry costs (1.80 avg), suggesting possible overfitting to feedback or instruction-following degradation under error correction.
•

DeepSeek-Coder (1.3B) achieved 0% across both configurations, establishing the canonical “unqualified” ( $\epsilon=0$ ) model baseline.

Password Validator (Reasoning Gap): This task exposed a reasoning capability threshold that correlates weakly with parameter count. Phi4 (14B) achieved 98% baseline, while StarCoder2 (15B) achieved 0%. The guards proved transformative for StarCoder2: from 0% baseline to 66% guarded ( $\Delta=+66\text{pp}$ , $p<0.001$ , Cohen’s $h=1.90$ )—the largest effect size observed. This suggests guards can bootstrap reasoning in models that understand the structure but fail on computation. Sub-3B models showed the capability threshold clearly: Qwen2.5-Coder (1.5B) reached 52% guarded from 14% baseline, while DeepSeek-Coder (1.3B) and Phi4-Mini (3.8B) remained at 0%.

Cost-Benefit Analysis: Across all valid trials, the framework demonstrates efficiency advantage over standard “Best-of-N” sampling. A comparable Pass@5 strategy incurs a fixed 5.0 $\times$ compute cost. In contrast, the sequential refinement strategy achieves reliable convergence with an average cost of just 1.2–1.6 $\times$ for qualified models. The cost-benefit ratio (gain per compute multiplier) was highest for mid-sized models: StarCoder2 (15B) on password achieved +35.9pp/x, while Qwen2.5-Coder (7B) on template achieved +20.3pp/x.

Key Insight—Task-Specific Qualification: A critical finding is that model qualification ( $\epsilon>0$ ) is task-specific, not global (see Figure 2). Phi4-Mini (3.8B) is qualified for LRU (60% baseline) but unqualified for password (0%). This has practical implications: guard-based systems should assess model capability per-task rather than assuming uniform competence.

6.6 TDD Workflow Benchmark

The preceding experiments validated single Atomic Action Pairs. To test multi-step workflows, a TDD pipeline was constructed where the output of one action pair becomes input to the next:

1.

g_test: Generate pytest test functions from specification (validated by SyntaxGuard)
2.

g_impl: Generate implementation that passes the generated tests (validated by DynamicTestGuard)

This creates a practical complication: the implementation must satisfy LLM-generated tests, not human-written ones. Specification errors in the first step propagate to the second.

Six tasks were selected across difficulty tiers: Stack and Queue (basic data structures), Calculator and LRUCache (state management), SimpleTemplate (string parsing), and PasswordValidator (exact error message matching). Three Qwen2.5-Coder variants (3B, 7B, 14B) were tested across 50 trials each with $R_{max}=3$ .

Table 6: TDD Workflow success rates across 50 trials per model-task pair.

Model	Stack	Queue	Calc.	LRU	Templ.	Pass.	Overall
Qwen2.5-Coder (14B)	88%	98%	84%	66%	68%	14%	70%
Qwen2.5-Coder (7B)	94%	98%	78%	50%	20%	0%	57%
Qwen2.5-Coder (3B)	68%	74%	68%	4%	2%	0%	36%

Table 7: TDD Workflow efficiency: average attempts and duration per trial.

Average Attempts
Model	Stack	Queue	Calc.	LRU	Templ.	Pass.
Qwen2.5-Coder (14B)	2.4	2.1	2.5	3.1	3.6	4.6
Qwen2.5-Coder (7B)	2.2	2.1	2.7	3.7	4.6	5.0
Qwen2.5-Coder (3B)	3.2	2.9	3.0	4.9	5.0	5.0
Average Duration (seconds)
Qwen2.5-Coder (14B)	19.5	15.7	16.8	43.3	33.8	42.3
Qwen2.5-Coder (7B)	6.9	6.7	8.4	26.3	25.1	29.1
Qwen2.5-Coder (3B)	9.2	8.8	7.4	34.5	19.3	26.2

The results confirm expected patterns: model scale correlates with success (70% for 14B vs. 36% for 3B), and task difficulty creates clear tiers. Easy tasks achieve 71–98% success; hard tasks drop to 0–14% for PasswordValidator.

The PasswordValidator results warrant attention. Even the 14B model achieves only 14% success—far below its single-task performance. Examining the failure artifacts reveals why: LLM-generated tests frequently contain incorrect edge case expectations (e.g., wrong error message ordering for multi-violation inputs). The implementation then fails not because it is wrong, but because it must satisfy a flawed specification.

This illustrates the risk identified in Remark 7: when LLM-generated artifacts become validators, specification errors compound through the workflow. The practical mitigation is straightforward—insert a HumanGuard checkpoint between test generation and implementation to catch specification errors before they propagate.

The framework supports this; the benchmark simply omitted it to measure the failure mode. Far from being an experimental flaw, this failure mode validates the central thesis of the framework: a stochastic generator cannot serve as its own ground-truth oracle. Without an external source of truth (a human, a formal spec, or a deterministic compiler), the agent creates a closed feedback loop of hallucination.

7 Limitations

While the Dual-State Framework provides rigorous guarantees for generative workflows, it is not a panacea. Five key limitations are identified that define the boundaries of its applicability:

•

Guard Design Overhead & Correctness: The framework shifts the burden of correctness from the stochastic prompt to the deterministic guard. This introduces a "Guard Design" bottleneck: the agent is only as reliable as the guard function itself. Furthermore, not all domains are easily formalizable; while syntax and functional correctness are verifiable, subjective qualities (e.g., "UI aesthetics" or "UX (User eXperience)") remain difficult to capture in deterministic predicates.
•

Generator Capability Threshold ( $\epsilon>0$ ): Theoretical convergence relies on the assumption that the generator has a non-zero probability ( $\epsilon$ ) of producing a valid artifact. As observed in the experiments with models under 3B parameters, this assumption does not hold for unqualified models. The framework cannot "fix" a model that fundamentally lacks the reasoning capacity to understand the task or the guard’s feedback.
•

Latency & Computational Cost: By definition, the refinement loop introduces latency. A 2.1 $\times$ computational overhead, while acceptable for asynchronous software development tasks, may be prohibitive for real-time applications requiring millisecond responsiveness.
•

Context Window Saturation: The Context Refinement mechanism ( $C_{k+1}=C_{k}\cup\phi$ ) relies on appending error traces to the history. For extremely complex failures or high retry limits, this can saturate the context window of the LLM, potentially degrading performance or incurring significant token costs.
•

Specification Brittleness: The framework assumes a static specification $\Psi$ . In highly exploratory domains where the requirements themselves are fluid or discovered during execution, the rigid pre-definition of guards may constrain the agent’s ability to find novel, out-of-distribution solutions.

8 Future Research

8.1 Autonomous Calibration of Latent Specifications

While Appendix F outlines a practical workflow for bootstrapping legacy systems[12], this process represents a distinct class of theoretical control problems: Specification Extraction via Oracle Inversion. Unlike standard generation where the specification is static and explicit ( $C_{\text{init}}$ ), legacy environments possess a latent specification encoded purely in binary execution behavior.

Future research should investigate the convergence properties of agents operating in this “Oracle Inversion” regime. Specifically, can the Dual-State Architecture guarantee that an agent’s set of generated characterization guards ( $G_{\text{char}}$ ) asymptotically approaches the true semantic boundaries of the legacy artifact? By modeling the “Bootstrapping Phase” as a System Identification task, we can theoretically bound the number of “sensing actions” (guard executions) required to achieve a target confidence level in the generated regression suite, effectively transforming “Legacy Refactoring” from an art into a measurable, convergent algorithmic process.

8.2 Continuous Learning via The Optimization Loop

The standard execution model treats retries as computational waste. Converting this overhead into a training signal is proposed by closing the loop between three distinct entities: the Guard (the critic), the Coach (the guide), and the Generator (the actor). This creates a four-tier optimization hierarchy:

8.2.1 Tier 1: Immediate Correction (The Coach)

While Guards must remain deterministic to preserve safety guarantees, the feedback mechanism benefits from the semantic reasoning of large language models. The Action Pair is formally extended into an Extended Action Tuple:

A^{+}=\langle\rho,a_{gen},G,a_{coach}\rangle

Here, $a_{coach}$ acts as a "Probabilistic Heuristic" or an internal "LLM-as-a-Judge." When the Guard $G$ fails, the Coach analyzes the binary failure signal and the artifact to produce a semantic refinement $\Delta C$ :

c_{new}\sim a_{coach}(s_{\text{env}},c_{old},\phi_{guard})

This decouples Safety (enforced by the deterministic Guard) from Liveness (promoted by the probabilistic Coach), allowing the agent to recover from failures using semantic feedback.

8.2.2 Tier 2: Sparse Reward Signal (The Critic)

Since guards provide ground-truth validity signals, they function as a trustworthy, albeit sparse, reward function for Reinforcement Learning (RL).

Definition 10 (Sparse Safety Reward).

A reward function $\mathcal{R}_{sparse}:\mathcal{S}_{\text{env}}\times\mathcal{G}\to\mathbb{R}$ is defined where:

\mathcal{R}_{sparse}(s_{\text{env}},G)=\begin{cases}+1&\text{if }G(s_{\text{env}})=\top\\ -1&\text{if }G(s_{\text{env}})=\bot\end{cases}

Remark 6 (The Maze Isomorphism).

This formulation draws a direct parallel to classical Q-Learning in grid-world environments. Just as a maze solving agent learns to avoid walls through negative rewards ( $r_{wall}<0$ ) while seeking the goal state [5], the Neuro-Symbolic system treats Logic Guards as “semantic walls.” The optimization loop thus effectively maps the high-dimensional, opaque manifold of the LLM onto a navigable, reward-driven maze, allowing standard RL techniques to optimize the agent’s trajectory away from invalid regions.

8.2.3 Tier 3: Dense Reward Signal (The Shaping)

While the Guard provides ground truth, the signal is sparse (binary). The Coach supplements this with a Dense Reward based on its semantic evaluation of the "distance" to the solution.

\mathcal{R}_{dense}(s_{\text{env}},a_{coach})\in[0,1]

This acts as a Reward Shaping mechanism. Even if an artifact fails the Guard (Sparse Reward = -1), the Coach may assign a high Dense Reward if the logic was "almost correct" (e.g., correct algorithm but wrong syntax). This allows the Generator to improve incrementally even within invalid regions of the search space.

8.2.4 Tier 4: Policy Distillation (The Update)

To minimize the expected runtime cost ( $E[retries]\to 0$ ), successful traces are utilized to fine-tune the generator $\pi_{\theta}$ . A refinement episode yields a trace $\tau=(c_{0},a_{fail},\phi_{dense},\Delta C,a_{success})$ .

The eventual success $a_{success}$ is treated as the target label, but the update is also conditioned on the Coach’s feedback $\phi_{dense}$ . This encourages the model not just to memorize the answer, but to internalize the reasoning process (the feedback) that led to it:

\mathcal{L}(\theta)=-\mathbb{E}_{\tau}[\log\pi_{\theta}(a_{success}\mid c_{0},\phi_{dense})]

This process effectively "compiles" the runtime reasoning loop—including the Coach’s guidance—into the model’s weights.

8.3 Dynamic Guarding: Meta-Policy Optimization

While this work formalizes the Guard as a fixed component of an Atomic Action Pair, future iterations can treat the Guard function $G$ as a distinct member of the agent’s available action space $\mathcal{A}$ . In this view, the agent is not merely a generator of code, but a rational decision-maker that must select the optimal verification strategy for a given state.

From an SDLC perspective, guards can be modeled as a Library of Actions available within specific parent states. For example, in a CodeReview state, the agent might have access to a set of verification actions:

\mathcal{A}_{verify}=\{\texttt{run\_syntax\_check},\texttt{run\_unit\_tests},\texttt{run\_security\_scan}\}

Each action carries a distinct computational cost and information gain. A simple syntax check is cheap but offers low safety assurance; a security scan is expensive but high-value.

8.4 Standardized Benchmarks for Probabilistic Control

Current code generation benchmarks (e.g., HumanEval, MBPP) primarily measure the static generative capability of models in a zero-shot regime. They do not capture the dynamic capabilities required for agentic workflows: error recovery, state maintenance across retries, and adherence to rigid environmental constraints.

The field requires a Control-Oriented Benchmark Suite—effectively a “GuardGym”—that evaluates agents not on their initial output, but on their ability to converge to a valid state under strict guard feedback. In this paradigm, the primary metrics shift from Pass@k to Refinement Efficiency (the mean number of retries required for convergence) and Trajectory Stability (the resistance to regression loops). Such a benchmark would isolate the architectural contribution of the control loop from the raw knowledge capacity of the model, providing a standardized method for evaluating neuro-symbolic bridges.

8.5 Multi-Agent Shared Truth

In collaborative environments, $\mathcal{S}_{\text{env}}$ is modified by multiple actors. The Dual-State framework provides synchronization without explicit message passing or complex consensus algorithms.

Proposition 5 (Shared Truth via Guards).

If two agents $Ag_{1}$ and $Ag_{2}$ execute the same deterministic Guard $G$ on the same shared artifact $a\in\mathcal{S}_{\text{env}}$ , they arrive at an identical belief regarding the workflow state component $s_{w}$ .

B_{Ag1}(s_{w})\cap B_{Ag2}(s_{w})\to\{G(a)\}

The workflow state $\mathcal{S}_{\text{workflow}}$ thus serves as a fully observable blackboard. For example, a downstream Implementation Agent does not need to query an upstream Specification Agent for status; it simply executes the relevant (verify-spec) sensing action on the shared artifact. If the Guard passes, the shared truth is established, and execution proceeds.

8.6 Formal Workflow Specification

While this work uses JSON-based task specifications (Appendix B), the Dual-State architecture is compatible with richer formalisms. Future work may extend the specification language to support HTN-style hierarchical decomposition with explicit parallel fork-join semantics (e.g., :ordering (add || tdd || bdd)) and typed generative actions with retry bounds. Such extensions would enable formal verification of workflow properties (deadlock freedom, guaranteed termination) prior to execution.

9 Broader Impact

9.1 Safety as a Systemic Property

A prevailing view in AI alignment seeks to make the generative model itself “categorically safe” through Reinforcement Learning from Human Feedback (RLHF) or constitutional training. However, this work proceeds from the premise that the stochastic nature of Large Language Models is not a defect to be eliminated, but a fundamental capability—a “superpower” that enables creativity and solution diversity. Attempts to constrain this stochasticity at the model weights level risk lobotomizing the very capability we seek to exploit.

Instead, this framework advocates for shifting the locus of safety from the component (the LLM) to the system (the Architecture). By accepting that the solution space of a generative model is inherently probabilistic and unsafe, we can focus on augmenting it with a deterministic control layer that enforces safety constraints. In this view, safety is not an intrinsic attribute of the intelligence, but an emergent property of the workflow in which that intelligence is embedded.

9.2 Auditability

By formalizing the Environment State as a Versioned Repository ( $\mathcal{R}$ ), the framework creates a record of rejected artifacts ( $a_{fail}$ ) and the feedback ( $\phi$ ) that guided correction. This supports post-hoc analysis of failure modes and convergence behavior.

The append-only structure also provides a degree of tamper-evidence: unauthorized insertions into the context would create discontinuities in the derivation graph that could be flagged by external auditors.

10 Conclusion

This paper formalizes a Dual-State Framework, an architecture that separates deterministic control flow from stochastic content generation in LLM-based systems. The central mechanism is the Atomic Action Pair, which couples generation with verification as an indivisible transaction. Guard functions serve not merely as filters, but as sensing actions that project opaque generative outputs onto an observable workflow state.

This enables Context Refinement, where guard feedback is incorporated into subsequent generation attempts. Through Guard Functions, Bounded Indeterminacy is achieved—the architecture does not eliminate the generator’s stochastic nature, but confines exploration within logical safety bounds. Additionally, because verification occurs immediately after each generation attempt, the architecture naturally produces immediate, attributable feedback—a property that may support future integration with reinforcement learning or fine-tuning approaches.

Experimental validation across 13 models indicates that the framework can substantially improve reliability for qualified instruction-following models, with observed gains of up to 66 percentage points. However, the results also highlight that guards cannot compensate for fundamental reasoning deficits; the model must possess sufficient capability to utilize verification feedback.

This work suggests a shift in how we view AI safety: not as an intrinsic property of the model weights to be trained in, but as a systemic property of the workflow to be architected. By treating the LLM’s stochasticity as a creative “superpower” rather than a defect, and wrapping it in the deterministic scaffolding of formal verification, we can build systems that are both imaginative and reliable.

As such, the framework’s primary contribution is not algorithmic novelty but a formal grounding: providing vocabulary, convergence conditions, and reasoning principles for architectural patterns already emerging in production systems—extending the tradition of deterministic frameworks for managing unpredictable processes to LLM-based code generation.

11 Acknowledgments

I would like to thank Mark Burgess¹¹1https://www.linkedin.com/in/markburgessoslo and Ray Myers²²2https://www.linkedin.com/in/cadrlife/ for their guidance on formalization and pointing me in sensible directions, and Joanna Bryson³³3https://www.linkedin.com/in/bryson/ for openly sharing her raw insights on AI ethics. I am also grateful to Professor Jeremy Scerri⁴⁴4https://www.linkedin.com/in/jeremy-scerri-b1b7b713/ and Jessica Sciammarelli⁵⁵5https://www.linkedin.com/in/jessicasciammarelli/ for their support and for opening critical pathways in my broader learning.

Special thanks to Lio⁶⁶6https://www.linkedin.com/in/lionelcrescence/ for providing the hardware resources, and to the Cohere Labs community⁷⁷7https://cohere.com/research/open-science, led by Madeline⁸⁸8https://www.linkedin.com/in/madeline-smith-3a0b8b155/, for providing a welcoming and energizing environment for this research.

I am also grateful to the software engineering community leaders who have championed the practices central to this work: Bryan Finster⁹⁹9https://www.linkedin.com/in/bryan-finster/, Tracy Bannon¹⁰¹⁰10https://www.linkedin.com/in/tracylbannon/, Patrik Debois¹¹¹¹11https://www.linkedin.com/in/patrickdebois/, Matthew Skelton¹²¹²12https://www.linkedin.com/in/matthewskelton/, and Rob Bowley¹³¹³13https://www.linkedin.com/in/robertbowley/.

Finally, thank you to Erwan Keraudy¹⁴¹⁴14https://www.linkedin.com/in/erwankeraudy/ and David Neil¹⁵¹⁵15https://www.linkedin.com/in/david-neil-44a67217b/ for being there at the start, my business partner: Hugo Miralles¹⁶¹⁶16https://www.linkedin.com/in/hugo-miralles/, and to my peers for their feedback and support: Oli, Natalia, Olgo, and Philip.

"There is one more thing, it’s been emotional."

References

[1] M. Burgess, “A site configuration engine,” Computing Systems, vol. 8, no. 2, pp. 309–337, 1995, mIT Press: Cambridge MA.
[2] ——, “Computer immunology,” Proceedings of the 12th Systems Administration Conference (LISA ’98), pp. 283–297, 1998. [Online]. Available: https://markburgess.org/papers/immune.pdf
[3] ——, “An approach to understanding policy based on autonomy and promises,” in Proceedings of the 16th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM). Springer, 2005, pp. 174–187.
[4] ——, Promise Theory: Principles and Applications. Xtaxis Press, 2014.
[5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.
[6] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.
[7] K. Erol, J. Hendler, and D. S. Nau, “Htn planning: Complexity and expressivity,” in Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI), vol. 94, 1994, pp. 1123–1128.
[8] S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy, “Llms can’t plan, but can help planning in llm-modulo frameworks,” arXiv preprint arXiv:2402.01817, 2024. [Online]. Available: https://arxiv.org/abs/2402.01817
[9] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2022.
[10] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
[11] Z. Shen, R. Gay, and X. Tao, “Goal-based intelligent agents,” International Journal of Information Technology, vol. 9, 2014.
[12] M. C. Feathers, Working Effectively with Legacy Code. Prentice Hall, 2004, introduces characterization testing as a technique for understanding legacy systems.

Appendix A Structural Resolution of the Credit Assignment Problem

A fundamental challenge in training and orchestrating agentic systems with Reinforcement Learning is the Credit Assignment Problem (CAP): determining which past action is responsible for the current reward. In standard Chain-of-Thought (CoT) or monolithic generation approaches, the “Time Horizon” between the error (e.g., a hallucinated variable at step $t=2$ ) and the penalty (e.g., a compilation error at step $t=50$ ) is large. This makes gradient attribution—or in-context correction—noisy and inefficient, as the agent must infer which specific token in the history caused the failure.

The Atomic Action Pair can serve as a structural solution to the Temporal CAP, effectively converting a sparse-reward search problem into a dense-reward optimization loop.

A.1 Collapsing the Reward Horizon

Standard autoregressive agents operate with a delayed reward horizon ( $T\gg 1$ ), where validation occurs only after the complete generation of a complex artifact. By enforcing that every generative action ( $a_{gen}$ ) is immediately followed by a sensing action ( $G$ ), the Dual-State architecture collapses the reward horizon to $T=1$ .

Consequently, the “discount factor” for error correction approaches zero ( $\gamma\approx 0$ ). Every state transition in the Workflow State is gated by an immediate validity signal, ensuring that “blame” for failure is instantly and correctly assigned to the most recent generative attempt. This may explain the efficiency observed in the experiments ( $1.2\text{--}2.1\times$ cost), as the agent never expends compute continuing a trajectory that has already diverged from validity.

A.2 The Coach as a Reward Shaping Mechanism

While the Guard ( $G$ ) provides a ground-truth signal, it is inherently sparse ( $r\in\{\perp,\top\}$ ). A sparse signal informs the agent that it failed, but not how to correct the error, potentially leading to random walk behavior during refinement.

The Coach ( $a_{coach}$ ), defined in Section 6.1.1, addresses this by introducing Reward Shaping. It effectively approximates a value function $V(s)$ by analyzing the Guard’s error trace ( $\phi$ ) and projecting a dense semantic signal back into the Context ( $C$ ):

C_{t+1}\leftarrow C_{t}\cup\text{Shaping}(a_{coach}(\phi))

(10)

Appendix B Workflow Specification Format

Workflows are specified in a declarative JSON format inspired by PDDL (Planning Domain Definition Language). Each workflow defines a directed acyclic graph (DAG) of action pairs, where edges represent artifact dependencies.

B.1 Schema

Listing 1: Workflow Specification Schema

⬇

2 "version": "1.0",

3 "workflows": {

4 "<workflow_id>": {

5 "name": "<Human-readable name>",

6 "specification": "<Natural language description>",

7 "action_pairs": {

8 "<action_pair_id>": {

9 "prompt": "<Generation prompt with {placeholders}>",

10 "guard": "<guard_type>",

11 "requires": ["<dependency_action_pair_ids>"]

12 }

13 }

14 }

15 }

16}

B.2 Guard Types

•

syntax: AST parsing validation (G₈)
•

dynamic_test: Runtime test execution (G₁₀)
•

type: Static type checking via mypy (G₉)
•

architecture: Layer boundary validation (G₁₁)

B.3 Example: TDD Stack Task

Listing 2: Stack Implementation Task

⬇

1"tdd_stack": {

2 "name": "Stack",

3 "specification": "Implement a Stack class with push, pop, peek, is_empty, and size methods.",

4 "steps": {

5 "g_test": {

6 "prompt": "Write pytest test functions for a Stack class...\nOutput ONLY the test code.",

7 "guard": "syntax"

8 },

9 "g_impl": {

10 "prompt": "Write a Python Stack class...\n You must implement code that passes the following tests:\n{test_code}",

11 "guard": "dynamic_test",

12 "requires": ["g_test"]

13 }

14 }

15}

The requires field creates an artifact dependency: g_impl receives the validated output of g_test via the {test_code} placeholder. This enables Test-Driven Development workflows where tests are generated first, then used to validate implementations.

Appendix C Guard Function Catalog

This appendix enumerates the deterministic guard functions ( $\mathcal{G}$ ) that enforce correctness constraints. Each guard validates a specific state transition.

Remark 7 (Human Oversight for Semantic Guards).

Certain guards validate artifacts that are semantically complex—where correctness cannot be verified by deterministic checks alone. These include:

•

Domain models (G₁): Whether entities and invariants correctly capture business requirements
•

Generated test specifications (G₄, G₆): Whether BDD scenarios or architecture tests capture intended constraints

When the LLM generates artifacts that themselves become validators (guards generating guards), a HumanGuard checkpoint is essential. Without human meta-validation, errors in generated specifications propagate silently through the entire validation chain.

Guards marked with $\dagger$ indicate recommended HumanGuard integration points.

Phase Overview

The guard catalog organizes 29 guards across 9 workflow phases, plus 6 bootstrapping guards for legacy systems (Appendix F):

Phase	Name	Guards	Primary Concern
1	Architecture Definition	G ${}_{1}^{\dagger}$ –G ${}_{4}^{\dagger}$	Domain model & structure
2	Test Definition	G₅–G ${}_{6}^{\dagger}$	Unit tests & BDD scenarios
3	Implementation	G₇–G₁₀	Code generation & validation
4	Architectural Compliance	G₁₁–G₁₃	Layer boundaries & DI
5	Behavioral Validation	G₁₄–G₁₅	Acceptance & quality gates
6	Operational Safety	G₁₆–G₁₇	Execution & file safety
7	Structure Audit	G₁₈–G₁₉	Documentation sync
8	Version Control	G₂₃	Pre-commit validation
9	Human Oversight	G₂₀	Final approval checkpoint
–	Composite	G₂₁–G₂₂	Guard composition patterns
–	Bootstrap (Legacy)	G₂₄–G₂₉	Brownfield system support

Table 8: Workflow phases and associated guards.

\dagger

= HumanGuard integration recommended.

Phases 1–2 establish what to build, phases 3–5 ensure correctness, and phases 6–8 enforce safety.

Phase 1: Architecture Definition

Phase 2: Test Definition

Phase 3: Implementation

Phase 4: Architectural Compliance

Phase 5: Behavioral Validation

Phase 6: Operational Safety

Phase 7: Structure Audit

Phase 8: Version Control Safety

Phase 9: Human Oversight

Composite Guards

Appendix D Guard Library Implementation

Reference implementations in Python. All guards implement the GuardInterface:

Listing 3: Abstract Guard Interface

⬇

1class GuardInterface(ABC):

2 @abstractmethod

3 def validate(self, artifact: Artifact, **deps) -> GuardResult:

4 """Returns␣(passed:␣bool,␣feedback:␣str)"""

D.1 SyntaxGuard (G₈)

Listing 4: Syntax Validation Logic

⬇

1class SyntaxGuard(GuardInterface):

2 def validate(self, artifact, **deps):

3 try:

4 ast.parse(artifact.content)

5 return GuardResult(passed=True)

6 except SyntaxError as e:

7 return GuardResult(

8 passed=False,

9 feedback=f"Line␣{e.lineno}:␣{e.msg}"

10 )

D.2 DynamicTestGuard (G₁₀)

Executes generated tests against generated implementation:

Listing 5: Dynamic Test Execution

⬇

1class DynamicTestGuard(GuardInterface):

2 def validate(self, artifact, **deps):

3 test_artifact = deps.get(’test’)

4 namespace = {}

5 try:

6 exec(artifact.content, namespace)

7 exec(test_artifact.content, namespace)

8 return GuardResult(passed=True)

9 except AssertionError as e:

10 return GuardResult(

11 passed=False,

12 feedback=f"Test␣failed:␣{e}"

13 )

D.3 ArchitectureBoundaryGuard (G₁₁)

Validates Clean Architecture dependency rule:

Listing 6: Dependency Analysis

⬇

1class ArchitectureBoundaryGuard(GuardInterface):

2 FORBIDDEN = {’boto3’, ’sqlalchemy’, ’requests’, ’django’}

4 def validate(self, artifact, **deps):

5 tree = ast.parse(artifact.content)

6 for node in ast.walk(tree):

7 if isinstance(node, ast.Import):

8 for alias in node.names:

9 if alias.name in self.FORBIDDEN:

10 return GuardResult(

11 passed=False,

12 feedback=f"Domain␣imports␣infrastructure:␣{alias.name}"

13 )

14 return GuardResult(passed=True)

D.4 HumanGuard (G₂₀)

Pauses workflow for human approval:

Listing 7: Human-in-the-Loop Validation

⬇

1class HumanGuard(GuardInterface):

2 """Prompts␣human␣to␣verify␣artifact."""

4 def __init__(self, prompt: str = "Approve␣this␣artifact?"):

5 self.prompt = prompt

7 def validate(self, artifact: Artifact, **deps) -> GuardResult:

8 print(f"\n{’=’*20}␣HUMAN␣REVIEW␣{’=’*20}")

9 preview = artifact.content[:500]

10 if len(artifact.content) > 500:

11 preview += "..."

12 print(f"\n{preview}")

13 print(f"\n{self.prompt}␣[y/n/feedback]:␣", end="")

15 response = input().strip().lower()

17 if response == ’y’:

18 return GuardResult(passed=True)

19 elif response == ’n’:

20 return GuardResult(

21 passed=False,

22 feedback="Human␣rejected␣artifact"

23 )

24 else:

25 # Treat as feedback for refinement

26 return GuardResult(

27 passed=False,

28 feedback=f"Human␣feedback:␣{response}"

29 )

D.5 CompositeGuard (G₂₁)

Combines multiple guards with AND semantics:

Listing 8: Composite Logic

⬇

1class CompositeGuard(GuardInterface):

2 def __init__(self, guards: List[GuardInterface]):

3 self._guards = guards

5 def validate(self, artifact, **deps):

6 for guard in self._guards:

7 result = guard.validate(artifact, **deps)

8 if not result.passed:

9 return result # Fail fast

10 return GuardResult(passed=True)

D.6 PreCommitGuard (G₂₃)

Validates staged changes before allowing a commit:

Listing 9: Pre-Commit Validation

⬇

1class PreCommitGuard(GuardInterface):

2 """Enforces␣code␣quality␣gates␣before␣version␣control␣commits."""

4 SECRET_PATTERNS = [

5 r"(?i)(api_key|secret|password|token)\s*=\s*[’\"][^’\"]+[’\"]",

6 r"(?i)Bearer\s+[A-Za-z0-9\-_]+\.[A-Za-z0-9\-_]+",

7 ]

9 def validate(self, artifact, **deps):

10 # 1. Security: Check for hardcoded secrets

11 for pattern in self.SECRET_PATTERNS:

12 if re.search(pattern, artifact.content):

13 return GuardResult(

14 passed=False,

15 feedback="Security␣Risk:␣Potential␣hardcoded␣secret␣detected."

16 )

18 # 2. Style: Check formatting compliance (e.g., Black)

19 try:

20 if not check_formatting(artifact.content):

21 return GuardResult(

22 passed=False,

23 feedback="Style␣Error:␣Code␣is␣not␣formatted␣(run␣’black’)."

24 )

25 except Exception as e:

26 return GuardResult(

27 passed=False,

28 feedback=f"Formatting␣check␣failed:␣{e}"

29 )

31 # 3. Quality: Run linter (e.g., Ruff, Flake8)

32 lint_errors = run_linter(artifact.content)

33 if lint_errors:

34 return GuardResult(

35 passed=False,

36 feedback=f"Linting␣failed:\n{lint_errors}"

37 )

39 return GuardResult(passed=True)

D.7 ParallelGuard

Independent guards can execute concurrently:

Listing 10: Parallel Execution

⬇

1class ParallelGuard(GuardInterface):

2 def __init__(self, guards, max_workers=4):

3 self._guards = guards

4 self._max_workers = max_workers

6 def validate(self, artifact, **deps):

7 failures = []

8 with ThreadPoolExecutor(max_workers=self._max_workers) as executor:

9 futures = {

10 executor.submit(g.validate, artifact)

11 for g in self._guards

12 }

13 for future in as_completed(futures):

14 result = future.result()

15 if not result.passed:

16 failures.append(result.feedback)

17 if failures:

18 return GuardResult(

19 passed=False,

20 feedback="\n---\n".join(failures)

21 )

22 return GuardResult(passed=True)

D.8 Bootstrap Guards

The following guards support bootstrapping legacy systems (see Appendix F).

Listing 11: Static Analysis Guard (G₂₄)

⬇

1class StaticAnalysisGuard(GuardInterface):

2 """G24:␣Validates␣legacy␣code␣is␣parseable␣and␣analyzable."""

4 def validate(self, artifact: Artifact, **deps) -> GuardResult:

5 code = artifact.content

6 errors = []

8 # Check AST parseability

9 try:

10 tree = ast.parse(code)

11 except SyntaxError as e:

12 return GuardResult(False, f"Syntax␣error:␣{e}")

14 # Check imports are resolvable

15 for node in ast.walk(tree):

16 if isinstance(node, ast.Import):

17 for alias in node.names:

18 if not self._can_resolve(alias.name):

19 errors.append(f"Unresolvable:␣{alias.name}")

20 elif isinstance(node, ast.ImportFrom):

21 if node.module and not self._can_resolve(node.module):

22 errors.append(f"Unresolvable:␣{node.module}")

24 if errors:

25 return GuardResult(False, "\n".join(errors))

26 return GuardResult(passed=True)

Listing 12: Characterization Guard (G₂₆)

⬇

1class CharacterizationGuard(GuardInterface):

2 """G26:␣Tests␣must␣pass␣against␣legacy␣code␣(oracle␣inversion)."""

4 def validate(self, artifact: Artifact, **deps) -> GuardResult:

5 test_code = artifact.content

6 legacy_code = deps["legacy_artifact"].content

8 with tempfile.TemporaryDirectory() as tmpdir:

9 impl_path = Path(tmpdir) / "implementation.py"

10 test_path = Path(tmpdir) / "test_impl.py"

11 impl_path.write_text(legacy_code)

12 test_path.write_text(test_code)

14 # Tests MUST pass - legacy code is the oracle

15 result = subprocess.run(

16 ["pytest", str(test_path), "-v"],

17 cwd=tmpdir, capture_output=True, timeout=30

18 )

20 if result.returncode != 0:

21 # In bootstrapping, test failure = TEST is wrong

22 return GuardResult(

23 passed=False,

24 feedback="Tests␣failed␣against␣legacy␣code.␣"

25 "The␣TEST␣is␣incorrect.\n"

26 f"{result.stdout.decode()}"

27 )

28 return GuardResult(passed=True)

Listing 13: Coverage Guard (G₂₇)

⬇

1class CoverageGuard(GuardInterface):

2 """G27:␣Validates␣test␣coverage␣meets␣thresholds."""

4 def __init__(self, line_threshold=0.8, branch_threshold=0.7):

5 self.line_threshold = line_threshold

6 self.branch_threshold = branch_threshold

8 def validate(self, artifact: Artifact, **deps) -> GuardResult:

9 with tempfile.TemporaryDirectory() as tmpdir:

10 # Run tests with coverage measurement

11 result = subprocess.run(

12 ["pytest", "--cov=implementation",

13 "--cov-report=json"],

14 cwd=tmpdir, capture_output=True

15 )

17 cov_file = Path(tmpdir) / "coverage.json"

18 cov_data = json.loads(cov_file.read_text())

19 line_cov = cov_data["totals"]["percent_covered"] / 100

21 if line_cov < self.line_threshold:

22 return GuardResult(

23 passed=False,

24 feedback=f"Coverage␣{line_cov:.1%}␣<␣"

25 f"threshold␣{self.line_threshold:.1%}"

26 )

27 return GuardResult(passed=True)

Appendix E TDD Workflow Execution Trace

This appendix illustrates a complete TDD workflow execution for the Stack task.

E.1 Workflow DAG

E.2 Step 1: Test Generation (g_test)

Prompt:

Listing 14: Test Generation Prompt

⬇

1Write pytest test functions for a Stack class

2with push, pop, peek, is_empty, and size methods.

3Output ONLY the test code.

Generation Attempt 1:

Listing 15: Generated Tests

⬇

1def test_push_pop():

2 s = Stack()

3 s.push(1)

4 assert s.pop() == 1

6def test_is_empty():

7 s = Stack()

8 assert s.is_empty()

9 s.push(1)

10 assert not s.is_empty()

Guard (SyntaxGuard): $\top$ (AST parses successfully)

State Transition: g_test $\rightarrow$ VALIDATED

E.3 Step 2: Implementation Generation (g_impl)

Prompt (with artifact injection):

Listing 16: Implementation Prompt

⬇

1Write a Python Stack class...

2You must implement code that passes:

3{test_code} # <- Injected from g_test

Generation Attempt 1:

Listing 17: First Draft Implementation

⬇

1class Stack:

2 def __init__(self):

3 self._items = []

5 def push(self, item):

6 self._items.append(item)

8 def pop(self):

9 return self._items.pop() # Bug: no empty check

Guard (DynamicTestGuard): $\bot$

Context Refinement:

Listing 18: Refinement Prompt

⬇

1Previous attempt failed:

2IndexError: pop from empty list

3Fix the implementation.

Generation Attempt 2:

Listing 19: Corrected Implementation

⬇

1class Stack:

2 def __init__(self):

3 self._items = []

5 def push(self, item):

6 self._items.append(item)

8 def pop(self):

9 if not self._items:

10 raise IndexError("pop␣from␣empty␣stack")

11 return self._items.pop()

13 # ... (other methods omitted for brevity)

Guard (DynamicTestGuard): $\top$ (all tests pass)

State Transition: g_impl $\rightarrow$ VALIDATED

E.4 Execution Summary

Step	Attempts	Guard Result	Duration
g_test	1	$\top$	2.3s
g_impl	2	$\top$	4.1s

Total retries: 1
Total duration: 6.4s
Workflow state: COMPLETE

Appendix F Bootstrapping Legacy Systems

The framework as presented assumes workflows begin with explicit specifications. Legacy (“brownfield”) systems present a practical challenge: the specification exists only implicitly in running code. There are no tests to validate against, no documented invariants to enforce.

This appendix sketches how the framework might be extended to support bootstrapping—generating the missing validation infrastructure from existing codebases. The approach is not novel; it formalizes characterization testing practices that predate this work [12].

F.1 The Initialization Problem

In greenfield systems, guards have well-defined pass/fail semantics from the start. In brownfield systems, we face a chicken-and-egg problem: we cannot validate code without tests, but we cannot write tests without understanding the code’s actual behavior.

Formally, for a legacy artifact $a_{legacy}$ and guard $G_{i}$ , the initial state is undefined—we lack the predicate to evaluate. The bootstrapping problem is to construct that predicate by observing the system’s behavior.

F.2 Characterization Testing as Guard Generation

The standard TDD relationship (code must satisfy tests) reverses during bootstrapping: tests must satisfy code. The legacy system becomes the oracle.

Approach	Oracle	Artifact Under Test
Standard TDD	Tests	Code must satisfy tests
Bootstrapping	Code	Tests must satisfy code

This has a practical consequence: test failure during bootstrapping indicates a bug in the test, not the system under test. The guard accepts tests only when they pass against unmodified legacy code—including behavior that might be considered bugs in a greenfield context but are now load-bearing “features.”

F.3 A Three-Phase Pipeline

The bootstrapping process can be decomposed into three phases. These are not novel contributions—they reflect standard practice in legacy system modernization—but expressing them as guard predicates allows integration with the framework.

F.3.1 Phase I: Structural Audit

The first phase establishes what can be analyzed without execution.

Output: Dependency graph, module boundaries, entry points—the structural map needed for targeted characterization.

F.3.2 Phase II: Characterization Testing

Characterization tests capture what the system does, not what it should do. The legacy code is the oracle.

Remark 8 (Untestable Code).

Code that resists characterization testing often indicates dead code, error handlers for impossible conditions, or race conditions. These require manual analysis via HumanGuard (G₂₀) rather than automated characterization.

F.3.3 Phase III: Constraint Promotion

Characterization tests become constraints. The system transitions from “undefined state” to “guarded state.”

Once G₂₉ passes, the legacy system has bootstrapped into a standard guarded workflow—future changes must satisfy the characterization tests. The system transitions from “we don’t know what this code does” to “we have tests that document what it does, and changes must preserve that behavior.”

This is not a guarantee of correctness in any absolute sense. The characterization tests capture observed behavior, which may include bugs. The value is that changes are now guarded—regressions become detectable.

Abstract

1 Introduction

1.1 Motivation: A Formal Building Block for Software Engineering

1.2 Theoretical Context and Prior Art

1.3 Framework Overview

2 Definitions

2.1 Foundational Definitions

2.2 Architectural Definitions

3 Formal Framework

3.1 Dual State Space

Definition 1 (State Space Decomposition).

Remark 1 (Information Abstraction).

3.2 Artifacts, Context, and Provenance

Definition 2 (Artifact Space & Versioning).

Definition 3 (Hierarchical Context Composition).

Remark 2 (Context Isolation).

3.3 State Evolution Logic

Definition 4 (Workflow Stability).

Definition 5 (Context Refinement).

3.4 Action Pairs & Preconditions

Definition 6 (Action Pair).

Remark 3 (Guard Input Scoping).

Definition 7 (System Dynamics).

Definition 8 (Workflow Transition Function).

3.5 Planning Problem

Definition 9 (Guard-Based Planning Problem).

Remark 4 (Complexity Collapse).

3.6 Execution Semantics

4 Theoretical Results

Assumption 1 (Generator ϵ\epsilon-Capability).

Lemma 1 (Artifact Determinism Sufficiency).

Proof.

Proposition 1 (Workflow State Projection).

Proof.

Proposition 2 (Finite Search Space via AND-OR Trees).

Proof.

Proposition 3 (Generator Independence).

Proof.

Proposition 4 (Asymptotic Soundness).

Proof.

Remark 5 (Conservative Bound).

Corollary 1 (Reliability Bound).

Corollary 2 (Complexity Bound).

5 Complexity Analysis

5.1 State Space Abstraction

5.2 Planning Complexity

6 Experimental Validation

6.1 Illustrative Validations

6.2 Model Qualification

6.3 Experimental Setup

6.4 Results

6.5 Analysis

6.6 TDD Workflow Benchmark

7 Limitations

8 Future Research

8.1 Autonomous Calibration of Latent Specifications

8.2 Continuous Learning via The Optimization Loop

8.2.1 Tier 1: Immediate Correction (The Coach)

8.2.2 Tier 2: Sparse Reward Signal (The Critic)

Definition 10 (Sparse Safety Reward).

Remark 6 (The Maze Isomorphism).

8.2.3 Tier 3: Dense Reward Signal (The Shaping)

8.2.4 Tier 4: Policy Distillation (The Update)

8.3 Dynamic Guarding: Meta-Policy Optimization

8.4 Standardized Benchmarks for Probabilistic Control

8.5 Multi-Agent Shared Truth

Proposition 5 (Shared Truth via Guards).

8.6 Formal Workflow Specification

9 Broader Impact

9.1 Safety as a Systemic Property

9.2 Auditability

10 Conclusion

11 Acknowledgments

References

Appendix A Structural Resolution of the Credit Assignment Problem

A.1 Collapsing the Reward Horizon

A.2 The Coach as a Reward Shaping Mechanism

Appendix B Workflow Specification Format

B.1 Schema

B.2 Guard Types

Assumption 1 (Generator $\epsilon$ -Capability).

D.1 SyntaxGuard (G₈)

D.2 DynamicTestGuard (G₁₀)

D.3 ArchitectureBoundaryGuard (G₁₁)

D.4 HumanGuard (G₂₀)

D.5 CompositeGuard (G₂₁)

D.6 PreCommitGuard (G₂₃)