Managing the Stochastic: Foundations of Learning in Neuro-Symbolic Systems for Software Engineering
Matthew Thompson
ORCID: 0009-0007-0846-0369
Independent Researcher
December 18, 2025
Preprint submitted to arXiv.
© 2025 Matthew Thompson. This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). Code available under MIT License.
Abstract
Current approaches to AI coding agents appear to blur the lines between the Large Language Model (LLM) and the agent itself, asking the LLM to make decisions best left to deterministic processes. This leads to systems prone to stochastic failures such as gaming unit tests or hallucinating syntax. Drawing on established software engineering practices that provide deterministic frameworks for managing unpredictable processes, this paper proposes setting the control boundary such that the LLM is treated as a component of the environment environment—preserving its creative stochasticity—rather than the decision-making agent.
A Dual-State Architecture is formalized, separating workflow state (deterministic control flow) from environment state (stochastic generation). Atomic Action Pairs couple generation with verification as indivisible transactions, where Guard Functions act as sensing actions that project probabilistic outputs onto observable workflow state.
The framework is validated on three code generation tasks across 13 LLMs (1.3B–15B parameters). For qualified instruction-following models, task success rates improved by up to 66 percentage points at 1.2–2.1 baseline computational cost. The results suggest that architectural constraints can substitute for parameter scale in achieving reliable code generation.
Keywords: Neuro-symbolic AI, LLM Agents, Runtime Verification, Code Generation, Iterative Refinement, Software Engineering
Contents
- 1 Introduction
- 2 Definitions
- 3 Formal Framework
- 4 Theoretical Results
- 5 Complexity Analysis
- 6 Experimental Validation
- 7 Limitations
- 8 Future Research
- 9 Broader Impact
- 10 Conclusion
- 11 Acknowledgments
- A Structural Resolution of the Credit Assignment Problem
- B Workflow Specification Format
- C Guard Function Catalog
- D Guard Library Implementation
- E TDD Workflow Execution Trace
- F Bootstrapping Legacy Systems
1 Introduction
1.1 Motivation: A Formal Building Block for Software Engineering
Modern software engineering has evolved sophisticated practices for managing complexity and uncertainty: CFEngine for infrastructure convergence, XP for iterative development, CI/CD for continuous validation, and Domain-Driven Design for semantic boundaries. These practices share a common thread—they provide deterministic frameworks for managing inherently unpredictable processes.
Large Language Models present a parallel challenge: substantial capability with stochastic behavior. While "Attention is All You Need" launched the transformer revolution, experience suggests that Attention is Not All You Need for Software Engineering—robust systems also require the scaffolding of verification, convergence, and bounded autonomy. Just as CFEngine treats infrastructure as eventually consistent rather than immediately correct, LLM outputs are treated in this framework as eventually valid through iterative refinement.
This work does not propose a revolutionary new algorithm, but rather a formalization of these emerging architectural patterns. The contribution is a theoretical grounding of these heuristics—providing the vocabulary, convergence guarantees, and formal reasoning framework required to transform ad-hoc "guardrails" into rigorous engineering disciplines.
Specifically, this work formalizes the separation of deterministic control flow from stochastic content generation. Through Atomic Action Pairs (inseparable generation-verification units) and a Dual-State Solution Space (workflow state versus environment state), the framework enables LLMs to operate within traditional software engineering bounds. Each verification failure provides feedback that refines subsequent generation attempts, achieving reliability through iteration rather than perfection.
Additionally, the architecture provides a potential approach to the Credit Assignment Problem inherent in LLM training. By enforcing immediate verification the framework naturally generates immediate reward signals attributed to specific generation attempts, rather than the sparse rewards typical of end-to-end generation. These verified traces can support both online reinforcement learning and offline supervised fine-tuning (e.g., via LoRA). Over time, this could theoretically reduce retry rates from the observed 2.1× toward 1.0× as models internalize domain-specific constraints.
Importantly, this architectural approach enables reliable systems using smaller, locally-deployed models (< 15B parameters) rather than requiring API access to frontier models. By substituting architectural rigor for parameter scale, organizations can maintain control over their software development pipeline while achieving high reliability.
1.2 Theoretical Context and Prior Art
The control dynamics of this framework draw from convergence patterns first encountered in configuration management, where CFEngine was leading the way in applied research [1, 2] and subsequently formalized as Promise Theory [3, 4]. In production systems, CFEngine demonstrated that reliability emerges not from commanding distributed components but from continuous convergence toward desired states—autonomous agents make promises rather than receive commands. This practical insight, later abstracted by Burgess into Promise Theory, provides the theoretical lens for managing stochastic systems: treat unreliable components as "promisers" and architect convergence operators to guide them toward validity.
Applied to LLMs, this convergence paradigm operates within the classical agent-environment boundary defined by Sutton & Barto [5], where the agent comprises only components modifiable by the control policy. Since the LLM’s weights cannot be modified during inference, it resides in the environment, with the agent function—per Russell & Norvig’s formalization [6]—mapping percepts (generation outputs) to actions (verification decisions).
Against this theoretical foundation, prior approaches to LLM control generally fall into two categories:
- External Control Architectures (Symbolic & Hybrid):
- Internal Control Architectures (LLM-Centric):
-
Conversely, techniques like Chain-of-Thought [9] and ReAct [10] locate the control loop inside the stochastic generation window. While flexible, these methods suffer from probabilistic control flow, where the agent’s decision-making process is subject to the same hallucination modes as its content generation.
This framework synthesizes these perspectives by utilizing external symbolic guards to enforce the internal convergence of generative promises.
1.3 Framework Overview
Building on goal-based agent frameworks [11], this work introduces a mechanism to externalize reasoning. The definition of an action is extended to utilize guard functions—not merely as gates, but as active postcondition validators that project the LLM’s internal generation process onto a verifiable external state.
These sensing actions enable a dual-state architecture that provides:
-
•
Observable Workflow Reasoning: Unlike opaque internal monologues (e.g., Chain-of-Thought), reasoning is captured in explicit state transitions, converting probabilistic generation into deterministic logical steps.
-
•
Bounded Indeterminacy: The system guarantees termination and cost control through deterministic validation predicates and finite retry budgets.
-
•
Atomic Composition: Generation and verification are treated as a single transactional unit—an "Atomic Action Pair"—ensuring that invalid content never pollutes the workflow state.
2 Definitions
To rigorously formalize the relationship between deterministic workflows and stochastic generators, foundational concepts in agency are distinguished from the specific architectural interpretations employed in this framework.
2.1 Foundational Definitions
- Promise Theory (Burgess, 2015)
-
A model of voluntary cooperation where autonomous agents issue promises regarding intended behavior rather than guarantees. Interactions are defined by the consumer’s responsibility to verify promise fulfillment, replacing command-and-control assumptions.
- Agent (Russell & Norvig, 1995)
-
A function mapping a complete percept history to actions. A rational agent selects actions that maximize an expected performance measure given its percept sequence and built-in knowledge.
- Control Boundary (Sutton & Barto, 1998)
-
The boundary defining the agent comprises only those components that can be arbitrarily modified by the control policy. Components outside this boundary constitute the environment.
- Bounded Rationality (Simon, 1955)
-
Rational agents operating under computational constraints do not optimize for the global maximum; instead, they satisfice, selecting the first solution that meets the aspiration level (validity criteria) within the available search budget.
- Weak Agency (Wooldridge & Jennings, 1995)
-
A software system exhibiting autonomy, reactivity, and pro-activeness, without implying consciousness or mental states.
2.2 Architectural Definitions
- Control Boundary (Generative Application)
-
This framework applies Sutton & Barto’s definition to the resource-dependent nature of Large Language Models (LLMs). The agent boundary is defined by modifiability relative to the time horizon:
-
•
Intra-Episode: The agent controls context composition () and state transitions ().
-
•
Inter-Episode: With sufficient compute, the agent may control adapter parameters (e.g., LoRA) or distilled weights.
-
•
Base Model: The pre-trained weights remain in the environment, providing a stochastic generation oracle that functions as the fixed generative component.
-
•
- Goal-Based Agent (Deterministic Controller)
-
A rational decision function that treats stochastic generations as percepts rather than actions. The agent observes the opaque output of the LLM (environment) and executes deterministic state transitions (actions) based on verification results. This ensures that while the content is stochastic, the control flow remains strictly deterministic.
- Neuro-Symbolic Agentic System
-
A software architecture integrating neural generation with symbolic verification. The LLM (as a component of the environment) issues generation promises; deterministic guard functions verify promise fulfillment within an atomic transaction, ensuring that invalid states are never committed to the persistent workflow history.
- Dual-State Architecture
-
An implementation pattern that separates the system state space into two distinct spaces:
-
•
(Control State): A deterministic, finite state machine tracking goal progress and guard satisfaction.
-
•
(Information State): An append-only versioned repository of generation history, artifacts, and guard feedback, enabling in-context learning without polluting the control flow.
-
•
3 Formal Framework
3.1 Dual State Space
Definition 1 (State Space Decomposition).
The system state space is decomposed into an observable workflow space and an opaque environment space:
| (1) |
-
•
Workflow State (): Defined as the set of all truth assignments to the guard functions:
(2) where is the set of unique guard identifiers.
-
•
Environment State (): Defined as the Cartesian product of the artifact space and context space:
(3) A specific environment state is denoted as a tuple , where is the current artifact (mutable result) and is the cumulative context (immutable history).
Remark 1 (Information Abstraction).
The workflow state acts as a finite abstraction of execution progress. While the guard function returns detailed feedback (e.g., compiler logs), this information is projected into the opaque Context (). Only the binary satisfaction signal is retained in , preserving the finiteness of the planning space.
3.2 Artifacts, Context, and Provenance
To ensure auditability and enable effective backtracking, the environment is formalized not as a mutable store, but as an append-only versioned repository.
Definition 2 (Artifact Space & Versioning).
Let be the set of all possible concrete outputs. A Versioned Repository is defined as a Directed Acyclic Graph (DAG) where nodes represent artifact versions and edges represent derivation steps.
Every generative action creates a new node in rather than overwriting the previous state. This strictly preserves the failure history (“rejected branches”) for future learning.
Definition 3 (Hierarchical Context Composition).
The context conditioning the generator and available to the guard is the composition of three distinct scopes:
| (4) |
-
•
Ambient Environment (): Contains the Versioned Repository (providing read-only access to all finalized ancestor and cousin artifacts) and Global Constraints .
-
•
Local Context (): The active scope for the current planning node, containing the Static Specification (requirements/tests for this specific step) and the Current Artifact .
-
•
Feedback History (): The accumulated sequence of guard rejections for this specific node: .
Remark 2 (Context Isolation).
By explicitly separating from , it is ensured that hallucinations or failures in a sub-task do not pollute the global context. When the workflow backtracks, is cleared and is reverted, but the Ambient Environment and Specification remain invariant.
3.3 State Evolution Logic
To preserve the finiteness of the planning space while enabling learning, a distinction is made between the Control State (Workflow) and the Information State (Context).
Definition 4 (Workflow Stability).
The workflow state is invariant under guard failure. That is, if the guard returns , the control state does not transition:
Progress in occurs exclusively upon guard satisfaction ().
Definition 5 (Context Refinement).
While the workflow state remains stable on failure, the context evolves to capture error signal . Let be the context at attempt . The transition is defined as:
This ensures that while the planner remains at the same node, the generator’s conditioning changes monotonically.
3.4 Action Pairs & Preconditions
Definition 6 (Action Pair).
An action is defined as a tuple representing the sequence of verification and execution:
-
•
is the Precondition (Entry Gate). It determines if the action pair is applicable in the current workflow state.
-
•
is the Generator (Execution). It consumes context to produce an artifact.
-
•
is the Guard (Exit Gate). It evaluates the artifact against the context to update the state or provide feedback.
Remark 3 (Guard Input Scoping).
While Definition 6 provides guards access to the full context (to theoretically allow validation against any historical artifact), in practice, well-designed guards should accept only the minimal required inputs. The Workflow configuration is responsible for extracting specific artifacts from (via ) and passing them explicitly to the execution runtime, preserving guard simplicity and testability.
Definition 7 (System Dynamics).
The evolution of the full system state (where and ) upon executing action is defined as:
-
1.
Generation: The generator conditions on the current context to produce a new artifact :
-
2.
Sensing: The guard evaluates the new artifact within context :
-
3.
State Update: The next state is determined by the guard result .
If (Advance):
(5) If (Refine):
(6)
Definition 8 (Workflow Transition Function).
3.5 Planning Problem
Definition 9 (Guard-Based Planning Problem).
The planning problem is a tuple
where:
-
•
is the initial control state (typically all guards ).
-
•
is the initial specification context.
-
•
is the set of satisfying goal states.
A solution is a policy that guarantees termination in within finite steps, subject to the generator capability assumption.
Remark 4 (Complexity Collapse).
Standard planning problems suffer from exponential search space explosion (). However, in industrial workflows where the task topology is fixed (i.e., sequential or DAG), the branching factor . This framework exploits this by converting the Search Problem (finding a path) into a Reliability Problem (ensuring the path is traversable). The complexity is thus dominated by the Retry Limit () rather than the workflow depth.
3.6 Execution Semantics
The execution of policy follows a refined control loop that handles context augmentation upon failure, as detailed in Algorithm 1.
4 Theoretical Results
To establish convergence guarantees, the generator’s competence is formalized as a probability bound.
Assumption 1 (Generator -Capability).
It is assumed that for any valid specification context , the generator has a non-zero probability of producing an artifact that satisfies the guard. Formally, there exists an such that:
where extracts the boolean status from the guard tuple.
Lemma 1 (Artifact Determinism Sufficiency).
If a generator produces artifact , then guard evaluation is deterministic regardless of whether the generator itself is deterministic.
Proof.
Since the artifact is immutable once produced, and is fixed during the evaluation step, and is a function mapping inputs to outputs, must yield the same result for repeated evaluations. ∎
Proposition 1 (Workflow State Projection).
The set of guard functions defines a deterministic projection . That is, for any opaque environment state , there exists exactly one corresponding observable workflow state .
Proof.
Let be the ordered set of guard functions. By Lemma 1, for a fixed artifact and context , the evaluation of each guard is deterministic. The evaluation vector is defined such that . The workflow state is defined as the truth assignment mapping , which is isomorphic to the vector . Since each component is deterministic, the vector (and thus the state ) is unique for any given . Thus, is a well-defined function. ∎
Proposition 2 (Finite Search Space via AND-OR Trees).
Let
be the set of reachable workflow states. The planning problem maps to a finite AND-OR tree search over , where solution existence is decidable.
Proof.
Step 1 (Tree Construction): A search tree is constructed, analogous to standard AO* search, where OR-Nodes represent agent decisions and AND-Nodes represent environmental responses .
Step 2 (Finiteness): A ’fail’ outcome from the Guard action increments the local retry counter while the workflow state remains constant (Definition 5). The generator action is valid (applicable) iff . Thus, no branch can exceed consecutive failures, and the total maximum depth is bounded by across all guards.
Step 3 (Decidability): Since the depth is bounded by the retry limit and the set of reachable states is finite (bounded by the workflow specification), the total tree size is finite.
∎
Proposition 3 (Generator Independence).
Plan correctness depends only on (a) guard predicate specifications and (b) the generator’s capability to eventually produce valid artifacts, independent of the generator’s internal mechanism.
Proof.
By Definition 8 (System Dynamics), the workflow state transition function depends exclusively on the current observable state and the guard result . The guard function evaluates the artifact directly, treating the generation process as a black box. While the generator’s internal mechanism determines the probability distribution of , Assumption 1 guarantees that this distribution has non-zero support for valid artifacts (). Therefore, the logic governing state advancement and goal satisfaction operates independently of the generator’s internal state space or transition probabilities. ∎
Proposition 4 (Asymptotic Soundness).
Given an -capable generator () and a finite contingent plan , the probability of failure approaches 0 as .
Proof.
Let be the probability of success for a single attempt. In the worst-case scenario where the generator is memoryless (i.e., it does not learn from the error context), the probability of node failure after attempts is . Since , the upper bound on failure probability is:
| (7) |
Since :
∎
Remark 5 (Conservative Bound).
The proof above assumes a "memoryless" generator (worst-case). In practice, because context accumulates feedback on every failure (Definition 6), the conditional probability of success typically increases with retries (). Thus, Inequality 7 represents a conservative lower bound on reliability.
Corollary 1 (Reliability Bound).
To achieve a target global reliability (where ), the minimum retry limit is:
| (8) |
where is the number of sequential steps in the workflow.
Corollary 2 (Complexity Bound).
The worst-case control complexity is linear with respect to the reachable state space size:
For sparse sequential workflows where , this ensures tractability.
5 Complexity Analysis
Computational complexity is analyzed relative to the Workflow State Space () rather than the Environment State Space ().
5.1 State Space Abstraction
Standard generative planning operates in , the space of all possible artifacts (e.g., all valid Unicode strings). Since , standard MDP planning algorithms with complexity are intractable.
Execution is projected into , defined by the boolean status of guards. While the theoretical size of this space is , it is observed that industrial workflows typically follow a strict sequential or Directed Acyclic Graph (DAG) structure.
| (9) |
The Reachable Execution Tree is defined as the subset of states reachable from under the transition function . For a sequential workflow of length , the reachable space is linear in .
Remark on Generator Cost: While generator execution is treated as an atomic action in the planning layer, each transition in corresponds to a potentially computationally intensive operation in (e.g., LLM inference). However, this cost is constant per node visit and does not affect the asymptotic complexity of the search algorithm itself.
5.2 Planning Complexity
Given guards and a retry limit , the planning problem reduces to finding an optimal policy in the finite AND-OR tree . The size of this tree is bounded by:
Finding the optimal policy involves a single traversal of this finite tree (e.g., via backward induction), with a computational cost of . This represents a reduction from infinite/intractable () to linear/polynomial complexity () with respect to the workflow length, conditioned on the assumption that the workflow structure is sparse.
6 Experimental Validation
6.1 Illustrative Validations
The primary contribution of this work is the formal Dual-State Framework and the concept of Atomic Action Pairs. To validate this architecture, a set of Diagnostic Probes was selected rather than broad industry benchmarks such as HumanEval or SWE-Bench.
This decision was driven by two factors. First, as an independent researcher with limited computational resources (time and GPU availability), the goal was to isolate the architectural mechanism efficiently rather than benchmark model parameters at scale. Second, and more critically, current benchmarks are not designed to evaluate agentic control flow in the context of formal verification.
For instance, while SWE-Bench provides realistic software engineering tasks, applying the framework would require a pre-existing library of Guard Actions (formal specifications or executable tests) mapped to each repository issue. Constructing such a “Guard Library” is a non-trivial engineering challenge and represents a distinct avenue for future research (potentially involving a meta-policy that selects guards dynamically).
Consequently, this experimental validation is illustrative. It is designed to answer a single mechanistic question: Does the introduction of deterministic guards enable a stochastic model to solve problems it could not solve zero-shot?
To this end, three tasks were selected (see Appendix C). It is hypothesized that while the models possess high prior knowledge (“High Priors”) for the concepts behind all three, they differ significantly in implementation difficulty:
-
•
LRU Cache: The model is expected to have High Priors for both the concept and the implementation. The pattern (Hash Map + Linked List) is standard in training data. The challenge is simply maintaining state without drift.
-
•
Template Engine: The model is expected to have High Priors in Concept (knowing what a template engine is), but Low Priors in Implementation. Since there is no single standard way to write a parser from scratch, the model must synthesize a novel solution rather than recalling a memorized one.
-
•
Password Validator: The model is expected to have High Priors in Concept but faces a Calculation Gap. While the rules are simple to state, satisfying them requires mathematical operations (e.g., Prime Number calculation) that are difficult for token-prediction models to implement correctly.
In choosing these tasks, specific Atomic Action Pairs are isolated to validate that deterministic guards can enforce reliability on a single stochastic generation node. Data was collected to test two primary hypotheses:
-
•
H1 (Reliability): The dual-state, guard-based planning framework significantly increases task success rates compared to a standard, single-attempt baseline.
-
•
H2 (Efficiency): The increase in reliability is achieved with modest additional generation attempts, and this efficiency varies predictably with model scale.
| Task | Prior Knowledge vs. Implementation Gap | Guard Role | Expected Behavior |
|---|---|---|---|
| LRU Cache | High Prior / Standard Implementation. The model likely memorized this pattern during training. | Drift Prevention. The guard ensures the model doesn’t make careless errors (hallucinations) in a known pattern. | High Baseline Success. It is expected for competent models to solve this easily; the guard mainly fixes minor slip-ups. |
| Template Engine | High Concept / Novel Implementation. The model knows the concept but must invent the specific logic (parsing) on the fly. | Syntax Guide. The guard provides error messages that help the model fix its specific implementation choices. | Optimization. The model is expected to start with broken code and iterate toward a working solution using the feedback. |
| Password Validator | High Concept / Calculation Gap. The model knows the rules but is unable to calculate the math (Primes) required to satisfy them. | Hard Gating. The guard rejects invalid math. | Low success. This is expected to be the hardest task because knowing the definition of a prime number doesn’t help the model calculate one. |
Limitations of this Approach: It is acknowledged that these tasks are bounded and synthetic. However, by restricting the problem space, confounding variables are eliminated — such as library knowledge or prompt ambiguity—allowing success to be attributed directly to the architectural intervention (the Guard mechanism).
6.2 Model Qualification
A foundational premise of this work is that the Dual-State Framework is an architectural multiplier for capability, not a substitute for it. The framework requires a generator capable of instruction following—specifically, the ability to ingest a guard’s error trace and attempt a semantic correction.
To isolate this architectural effect, a model is deemed qualified if it consistently produces parsable output in a zero-shot prompt. Models that fail this basic threshold (e.g., raw FIM models, or those producing blank tokens) violate Assumption 1 (). Including them would conflate architectural failure with model incapacity.
Specifically, the qualification threshold requires the model to adhere to the requested output format (Markdown code blocks) under a generic system prompt. Models that produce valid code but fail to wrap it in standard formatting (e.g., Markdown backticks) are classified as unqualified, as they fail the fundamental agentic requirement of Interface Compliance.
6.3 Experimental Setup
-
•
Workflow: The workflow for all tasks enforces structural correctness via a sequential Guard Validation Chain:
-
1.
Generation: The probabilistic output from the LLM.
-
2.
Syntax Validation: Enforced by SyntaxGuard (validating Python AST parsing).
-
3.
Functional Correctness: Enforced by TestGuard (validating functional correctness via unit tests).
Each guard failure triggers the refinement loop (), preventing transition to the next logical state until the constraint is met.
-
1.
-
•
Models & Runtime: 13 models from 6 families were evaluated (see Table 2). All models were executed locally using Ollama v0.12.3 to ensure a standardized, offline inference environment. The focus is deliberately on Small Language Models (SLMs) (15B parameters) to test the hypothesis that architectural constraints can substitute for parameter scale. Demonstrating high reliability on these lightweight models validates the framework’s ability to act as a capability multiplier, enabling secure, local execution for SDLC tasks without reliance on massive, proprietary APIs.
-
•
Hardware Specification: Experiments were conducted on a workstation equipped with an AMD Threadripper PRO (32 cores), 125GB RAM, and an NVIDIA RTX A4000 GPU. This setup accommodated the varying VRAM requirements of models ranging from 1.3B to 15B parameters without quantization loss beyond the standard 4-bit (q4_k_m) schema.
-
•
Inference Parameters: To verify the architecture’s ability to manage high variance, all models were sampled with temperature . This relatively high entropy setting ensures the generator is sufficiently "irrational," testing the framework’s ability to constrain a stochastic process.
-
•
Software Harness: The control loop was implemented in Python 3.12.11. To strictly isolate the "Control" from the "Generation," the harness enforces:
-
–
Context Isolation: A hard reset of the inference context window between trials.
-
–
Template Normalization: A unified system prompt is applied across all models, intentionally avoiding model-specific chat templates. This acts as a stress test for "Instruction Following Robustness"—models that rely on bespoke control tokens rather than natural language instructions fail the qualification step.
-
–
-
•
Guards (): A single comprehensive guard was implemented, , which encapsulates:
-
1.
Execution Safety: Controlled process execution with a 60s timeout to prevent infinite loops.
-
2.
Functional Correctness: A suite of task-specific unit tests against the generated artifact.
-
3.
Diagnostic Feedback: If failure occurs, the guard returns specific error traces to guide retries.
-
1.
-
•
Configurations: Two execution modes are compared:
-
–
Baseline (One-Shot): A single generation attempt (), measuring the model’s raw zero-shot capability.
-
–
Guarded (Agentic): The contingent planner with a retry limit of . This mode utilizes the guard’s diagnostic feedback to iteratively refine the artifact upon failure.
-
–
| Category | Models |
|---|---|
| Large (9B+) | Qwen2.5-Coder (14B), StarCoder2 (15B), Phi4 (14B) |
| Medium (4-8B) | Yi-Coder (9B), Granite-Code (8B), Qwen2.5-Coder (7B), CodeGemma (7B), DeepSeek-Coder (6.7B) |
| Small (2-4B) | Qwen2.5-Coder (3B), Granite-Code (3B), Phi4-Mini (3.8B) |
| Tiny (2B) | Qwen2.5-Coder (1.5B), CodeGemma (2B), Yi-Coder (1.5B), DeepSeek-Coder (1.3B) |
6.4 Results
50 independent trials were executed for each model across the three diagnostic probes. To quantify the architectural benefit of the Dual-State Framework, the Baseline Success (One-Shot, ) is reported, the Guarded Success (), and the Reliability Gain (), which represents the absolute percentage point improvement attributable to the guard mechanism.
| Model | Base () | Guarded () | Gain () | Avg Retries |
|---|---|---|---|---|
| Yi-Coder (9B) | 56% | 98% | +42*** | 0.92 |
| StarCoder2 (15B) | 60% | 100% | +40*** | 0.32 |
| Qwen2.5-Coder (3B) | 8% | 42% | +34*** | 2.72 |
| Qwen2.5-Coder (7B) | 70% | 98% | +28*** | 0.38 |
| Granite-Code (8B) | 50% | 76% | +26* | 1.46 |
| Phi4 (14B) | 8% | 26% | +18* | 2.26 |
| Qwen2.5-Coder (14B) | 86% | 100% | +14* | 0.20 |
| DeepSeek-Coder (6.7B) | 4% | 18% | +14 | 3.10 |
| Unqualified ()* | 0–2% | 0–6% |
*DeepSeek-Coder (1.3B), Phi4-Mini (3.8B), Yi-Coder (1.5B), Qwen2.5-Coder (1.5B)
| Model | Base () | Guarded () | Gain () | Avg Retries |
| DeepSeek-Coder (6.7B) | 48% | 98% | +50*** | 0.76 |
| Granite-Code (8B) | 60% | 98% | +38*** | 0.52 |
| Yi-Coder (1.5B) | 62% | 98% | +36*** | 0.76 |
| Qwen2.5-Coder (1.5B) | 74% | 98% | +24*** | 0.22 |
| Granite-Code (3B) | 80% | 98% | +18** | 0.38 |
| StarCoder2 (15B) | 86% | 100% | +14* | 0.10 |
| Qwen2.5-Coder (7B) | 92% | 100% | +8 | 0.06 |
| Qwen2.5-Coder (3B) | 96% | 100% | +4 | 0.18 |
| Qwen2.5-Coder (14B) | 98% | 100% | +2 | 0.00 |
| Phi4 (14B) | 100% | 100% | – | 0.00 |
| Yi-Coder (9B) | 100% | 100% | – | 0.04 |
| Phi4-Mini (3.8B) | 60% | 58% | -2 | 1.80 |
| DeepSeek-Coder (1.3B) | 0% | 0% | – | 3.94 |
| Model | Base () | Guarded () | Gain () | Avg Retries |
| StarCoder2 (15B) | 0% | 66% | +66*** | 0.84 |
| DeepSeek-Coder (6.7B) | 50% | 96% | +46*** | 0.72 |
| Granite-Code (3B) | 36% | 80% | +44*** | 1.46 |
| Qwen2.5-Coder (1.5B) | 14% | 52% | +38*** | 1.88 |
| Granite-Code (8B) | 58% | 94% | +36*** | 0.92 |
| Yi-Coder (9B) | 76% | 100% | +24*** | 0.38 |
| Yi-Coder (1.5B) | 24% | 36% | +12 | 2.54 |
| Qwen2.5-Coder (7B) | 90% | 100% | +10 | 0.32 |
| Qwen2.5-Coder (3B) | 92% | 100% | +8 | 0.28 |
| Phi4 (14B) | 98% | 100% | +2 | 0.06 |
| Phi4-Mini (3.8B) | 0% | 0% | – | 2.18 |
| DeepSeek-Coder (1.3B) | 0% | 0% | – | 3.92 |
Note: Qwen2.5-Coder (14B) was excluded from this task due to data corruption during logging.
6.5 Analysis
The expanded benchmark across 13 models (ranging from 1.3B to 15B parameters) reveals a nuanced capability landscape. Statistical significance was assessed using Fisher’s exact test, with effect sizes reported as Cohen’s h for proportions.
Template Engine (Structural Gap): This task exhibited the widest variance in guard effectiveness. Top performers achieved substantial gains: Yi-Coder (9B) improved from 56% to 98% (, , Cohen’s ), while StarCoder2 (15B) reached perfect reliability from a 60% baseline. The template task proved most discriminating for smaller models—DeepSeek-Coder (6.7B) achieved only 18% guarded success (, ), suggesting that instruction-following fidelity limits how effectively feedback can be utilized. The sub-3B models showed negligible improvement (), establishing a clear capability threshold.
LRU Cache (Drift Prevention): The LRU task confirmed the framework’s efficiency for well-understood patterns. Eleven of thirteen models achieved 98% guarded success. Notable findings include:
-
•
DeepSeek-Coder (6.7B) showed the largest gain (, , Cohen’s ), demonstrating that guards effectively close the reliability gap for mid-capability models.
-
•
Phi4-Mini (3.8B) exhibited anomalous behavior: a negative gain (-2pp) with high retry costs (1.80 avg), suggesting possible overfitting to feedback or instruction-following degradation under error correction.
-
•
DeepSeek-Coder (1.3B) achieved 0% across both configurations, establishing the canonical “unqualified” () model baseline.
Password Validator (Reasoning Gap): This task exposed a reasoning capability threshold that correlates weakly with parameter count. Phi4 (14B) achieved 98% baseline, while StarCoder2 (15B) achieved 0%. The guards proved transformative for StarCoder2: from 0% baseline to 66% guarded (, , Cohen’s )—the largest effect size observed. This suggests guards can bootstrap reasoning in models that understand the structure but fail on computation. Sub-3B models showed the capability threshold clearly: Qwen2.5-Coder (1.5B) reached 52% guarded from 14% baseline, while DeepSeek-Coder (1.3B) and Phi4-Mini (3.8B) remained at 0%.
Cost-Benefit Analysis: Across all valid trials, the framework demonstrates efficiency advantage over standard “Best-of-N” sampling. A comparable Pass@5 strategy incurs a fixed 5.0 compute cost. In contrast, the sequential refinement strategy achieves reliable convergence with an average cost of just 1.2–1.6 for qualified models. The cost-benefit ratio (gain per compute multiplier) was highest for mid-sized models: StarCoder2 (15B) on password achieved +35.9pp/x, while Qwen2.5-Coder (7B) on template achieved +20.3pp/x.
Key Insight—Task-Specific Qualification: A critical finding is that model qualification () is task-specific, not global (see Figure 2). Phi4-Mini (3.8B) is qualified for LRU (60% baseline) but unqualified for password (0%). This has practical implications: guard-based systems should assess model capability per-task rather than assuming uniform competence.
6.6 TDD Workflow Benchmark
The preceding experiments validated single Atomic Action Pairs. To test multi-step workflows, a TDD pipeline was constructed where the output of one action pair becomes input to the next:
-
1.
g_test: Generate pytest test functions from specification (validated by SyntaxGuard)
-
2.
g_impl: Generate implementation that passes the generated tests (validated by DynamicTestGuard)
This creates a practical complication: the implementation must satisfy LLM-generated tests, not human-written ones. Specification errors in the first step propagate to the second.
Six tasks were selected across difficulty tiers: Stack and Queue (basic data structures), Calculator and LRUCache (state management), SimpleTemplate (string parsing), and PasswordValidator (exact error message matching). Three Qwen2.5-Coder variants (3B, 7B, 14B) were tested across 50 trials each with .
| Model | Stack | Queue | Calc. | LRU | Templ. | Pass. | Overall |
|---|---|---|---|---|---|---|---|
| Qwen2.5-Coder (14B) | 88% | 98% | 84% | 66% | 68% | 14% | 70% |
| Qwen2.5-Coder (7B) | 94% | 98% | 78% | 50% | 20% | 0% | 57% |
| Qwen2.5-Coder (3B) | 68% | 74% | 68% | 4% | 2% | 0% | 36% |
| Model | Stack | Queue | Calc. | LRU | Templ. | Pass. |
|---|---|---|---|---|---|---|
| Average Attempts | ||||||
| Qwen2.5-Coder (14B) | 2.4 | 2.1 | 2.5 | 3.1 | 3.6 | 4.6 |
| Qwen2.5-Coder (7B) | 2.2 | 2.1 | 2.7 | 3.7 | 4.6 | 5.0 |
| Qwen2.5-Coder (3B) | 3.2 | 2.9 | 3.0 | 4.9 | 5.0 | 5.0 |
| Average Duration (seconds) | ||||||
| Qwen2.5-Coder (14B) | 19.5 | 15.7 | 16.8 | 43.3 | 33.8 | 42.3 |
| Qwen2.5-Coder (7B) | 6.9 | 6.7 | 8.4 | 26.3 | 25.1 | 29.1 |
| Qwen2.5-Coder (3B) | 9.2 | 8.8 | 7.4 | 34.5 | 19.3 | 26.2 |
The results confirm expected patterns: model scale correlates with success (70% for 14B vs. 36% for 3B), and task difficulty creates clear tiers. Easy tasks achieve 71–98% success; hard tasks drop to 0–14% for PasswordValidator.
The PasswordValidator results warrant attention. Even the 14B model achieves only 14% success—far below its single-task performance. Examining the failure artifacts reveals why: LLM-generated tests frequently contain incorrect edge case expectations (e.g., wrong error message ordering for multi-violation inputs). The implementation then fails not because it is wrong, but because it must satisfy a flawed specification.
This illustrates the risk identified in Remark 7: when LLM-generated artifacts become validators, specification errors compound through the workflow. The practical mitigation is straightforward—insert a HumanGuard checkpoint between test generation and implementation to catch specification errors before they propagate.
The framework supports this; the benchmark simply omitted it to measure the failure mode. Far from being an experimental flaw, this failure mode validates the central thesis of the framework: a stochastic generator cannot serve as its own ground-truth oracle. Without an external source of truth (a human, a formal spec, or a deterministic compiler), the agent creates a closed feedback loop of hallucination.
7 Limitations
While the Dual-State Framework provides rigorous guarantees for generative workflows, it is not a panacea. Five key limitations are identified that define the boundaries of its applicability:
-
•
Guard Design Overhead & Correctness: The framework shifts the burden of correctness from the stochastic prompt to the deterministic guard. This introduces a "Guard Design" bottleneck: the agent is only as reliable as the guard function itself. Furthermore, not all domains are easily formalizable; while syntax and functional correctness are verifiable, subjective qualities (e.g., "UI aesthetics" or "UX (User eXperience)") remain difficult to capture in deterministic predicates.
-
•
Generator Capability Threshold (): Theoretical convergence relies on the assumption that the generator has a non-zero probability () of producing a valid artifact. As observed in the experiments with models under 3B parameters, this assumption does not hold for unqualified models. The framework cannot "fix" a model that fundamentally lacks the reasoning capacity to understand the task or the guard’s feedback.
-
•
Latency & Computational Cost: By definition, the refinement loop introduces latency. A 2.1 computational overhead, while acceptable for asynchronous software development tasks, may be prohibitive for real-time applications requiring millisecond responsiveness.
-
•
Context Window Saturation: The Context Refinement mechanism () relies on appending error traces to the history. For extremely complex failures or high retry limits, this can saturate the context window of the LLM, potentially degrading performance or incurring significant token costs.
-
•
Specification Brittleness: The framework assumes a static specification . In highly exploratory domains where the requirements themselves are fluid or discovered during execution, the rigid pre-definition of guards may constrain the agent’s ability to find novel, out-of-distribution solutions.
8 Future Research
8.1 Autonomous Calibration of Latent Specifications
While Appendix F outlines a practical workflow for bootstrapping legacy systems[12], this process represents a distinct class of theoretical control problems: Specification Extraction via Oracle Inversion. Unlike standard generation where the specification is static and explicit (), legacy environments possess a latent specification encoded purely in binary execution behavior.
Future research should investigate the convergence properties of agents operating in this “Oracle Inversion” regime. Specifically, can the Dual-State Architecture guarantee that an agent’s set of generated characterization guards () asymptotically approaches the true semantic boundaries of the legacy artifact? By modeling the “Bootstrapping Phase” as a System Identification task, we can theoretically bound the number of “sensing actions” (guard executions) required to achieve a target confidence level in the generated regression suite, effectively transforming “Legacy Refactoring” from an art into a measurable, convergent algorithmic process.
8.2 Continuous Learning via The Optimization Loop
The standard execution model treats retries as computational waste. Converting this overhead into a training signal is proposed by closing the loop between three distinct entities: the Guard (the critic), the Coach (the guide), and the Generator (the actor). This creates a four-tier optimization hierarchy:
8.2.1 Tier 1: Immediate Correction (The Coach)
While Guards must remain deterministic to preserve safety guarantees, the feedback mechanism benefits from the semantic reasoning of large language models. The Action Pair is formally extended into an Extended Action Tuple:
Here, acts as a "Probabilistic Heuristic" or an internal "LLM-as-a-Judge." When the Guard fails, the Coach analyzes the binary failure signal and the artifact to produce a semantic refinement :
This decouples Safety (enforced by the deterministic Guard) from Liveness (promoted by the probabilistic Coach), allowing the agent to recover from failures using semantic feedback.
8.2.2 Tier 2: Sparse Reward Signal (The Critic)
Since guards provide ground-truth validity signals, they function as a trustworthy, albeit sparse, reward function for Reinforcement Learning (RL).
Definition 10 (Sparse Safety Reward).
A reward function is defined where:
Remark 6 (The Maze Isomorphism).
This formulation draws a direct parallel to classical Q-Learning in grid-world environments. Just as a maze solving agent learns to avoid walls through negative rewards () while seeking the goal state [5], the Neuro-Symbolic system treats Logic Guards as “semantic walls.” The optimization loop thus effectively maps the high-dimensional, opaque manifold of the LLM onto a navigable, reward-driven maze, allowing standard RL techniques to optimize the agent’s trajectory away from invalid regions.
8.2.3 Tier 3: Dense Reward Signal (The Shaping)
While the Guard provides ground truth, the signal is sparse (binary). The Coach supplements this with a Dense Reward based on its semantic evaluation of the "distance" to the solution.
This acts as a Reward Shaping mechanism. Even if an artifact fails the Guard (Sparse Reward = -1), the Coach may assign a high Dense Reward if the logic was "almost correct" (e.g., correct algorithm but wrong syntax). This allows the Generator to improve incrementally even within invalid regions of the search space.
8.2.4 Tier 4: Policy Distillation (The Update)
To minimize the expected runtime cost (), successful traces are utilized to fine-tune the generator . A refinement episode yields a trace .
The eventual success is treated as the target label, but the update is also conditioned on the Coach’s feedback . This encourages the model not just to memorize the answer, but to internalize the reasoning process (the feedback) that led to it:
This process effectively "compiles" the runtime reasoning loop—including the Coach’s guidance—into the model’s weights.
8.3 Dynamic Guarding: Meta-Policy Optimization
While this work formalizes the Guard as a fixed component of an Atomic Action Pair, future iterations can treat the Guard function as a distinct member of the agent’s available action space . In this view, the agent is not merely a generator of code, but a rational decision-maker that must select the optimal verification strategy for a given state.
From an SDLC perspective, guards can be modeled as a Library of Actions available within specific parent states. For example, in a CodeReview state, the agent might have access to a set of verification actions:
Each action carries a distinct computational cost and information gain. A simple syntax check is cheap but offers low safety assurance; a security scan is expensive but high-value.
8.4 Standardized Benchmarks for Probabilistic Control
Current code generation benchmarks (e.g., HumanEval, MBPP) primarily measure the static generative capability of models in a zero-shot regime. They do not capture the dynamic capabilities required for agentic workflows: error recovery, state maintenance across retries, and adherence to rigid environmental constraints.
The field requires a Control-Oriented Benchmark Suite—effectively a “GuardGym”—that evaluates agents not on their initial output, but on their ability to converge to a valid state under strict guard feedback. In this paradigm, the primary metrics shift from Pass@k to Refinement Efficiency (the mean number of retries required for convergence) and Trajectory Stability (the resistance to regression loops). Such a benchmark would isolate the architectural contribution of the control loop from the raw knowledge capacity of the model, providing a standardized method for evaluating neuro-symbolic bridges.
8.5 Multi-Agent Shared Truth
In collaborative environments, is modified by multiple actors. The Dual-State framework provides synchronization without explicit message passing or complex consensus algorithms.
Proposition 5 (Shared Truth via Guards).
If two agents and execute the same deterministic Guard on the same shared artifact , they arrive at an identical belief regarding the workflow state component .
The workflow state thus serves as a fully observable blackboard. For example, a downstream Implementation Agent does not need to query an upstream Specification Agent for status; it simply executes the relevant (verify-spec) sensing action on the shared artifact. If the Guard passes, the shared truth is established, and execution proceeds.
8.6 Formal Workflow Specification
While this work uses JSON-based task specifications (Appendix B), the Dual-State architecture is compatible with richer formalisms. Future work may extend the specification language to support HTN-style hierarchical decomposition with explicit parallel fork-join semantics (e.g., :ordering (add || tdd || bdd)) and typed generative actions with retry bounds. Such extensions would enable formal verification of workflow properties (deadlock freedom, guaranteed termination) prior to execution.
9 Broader Impact
9.1 Safety as a Systemic Property
A prevailing view in AI alignment seeks to make the generative model itself “categorically safe” through Reinforcement Learning from Human Feedback (RLHF) or constitutional training. However, this work proceeds from the premise that the stochastic nature of Large Language Models is not a defect to be eliminated, but a fundamental capability—a “superpower” that enables creativity and solution diversity. Attempts to constrain this stochasticity at the model weights level risk lobotomizing the very capability we seek to exploit.
Instead, this framework advocates for shifting the locus of safety from the component (the LLM) to the system (the Architecture). By accepting that the solution space of a generative model is inherently probabilistic and unsafe, we can focus on augmenting it with a deterministic control layer that enforces safety constraints. In this view, safety is not an intrinsic attribute of the intelligence, but an emergent property of the workflow in which that intelligence is embedded.
9.2 Auditability
By formalizing the Environment State as a Versioned Repository (), the framework creates a record of rejected artifacts () and the feedback () that guided correction. This supports post-hoc analysis of failure modes and convergence behavior.
The append-only structure also provides a degree of tamper-evidence: unauthorized insertions into the context would create discontinuities in the derivation graph that could be flagged by external auditors.
10 Conclusion
This paper formalizes a Dual-State Framework, an architecture that separates deterministic control flow from stochastic content generation in LLM-based systems. The central mechanism is the Atomic Action Pair, which couples generation with verification as an indivisible transaction. Guard functions serve not merely as filters, but as sensing actions that project opaque generative outputs onto an observable workflow state.
This enables Context Refinement, where guard feedback is incorporated into subsequent generation attempts. Through Guard Functions, Bounded Indeterminacy is achieved—the architecture does not eliminate the generator’s stochastic nature, but confines exploration within logical safety bounds. Additionally, because verification occurs immediately after each generation attempt, the architecture naturally produces immediate, attributable feedback—a property that may support future integration with reinforcement learning or fine-tuning approaches.
Experimental validation across 13 models indicates that the framework can substantially improve reliability for qualified instruction-following models, with observed gains of up to 66 percentage points. However, the results also highlight that guards cannot compensate for fundamental reasoning deficits; the model must possess sufficient capability to utilize verification feedback.
This work suggests a shift in how we view AI safety: not as an intrinsic property of the model weights to be trained in, but as a systemic property of the workflow to be architected. By treating the LLM’s stochasticity as a creative “superpower” rather than a defect, and wrapping it in the deterministic scaffolding of formal verification, we can build systems that are both imaginative and reliable.
As such, the framework’s primary contribution is not algorithmic novelty but a formal grounding: providing vocabulary, convergence conditions, and reasoning principles for architectural patterns already emerging in production systems—extending the tradition of deterministic frameworks for managing unpredictable processes to LLM-based code generation.
11 Acknowledgments
I would like to thank Mark Burgess111https://www.linkedin.com/in/markburgessoslo and Ray Myers222https://www.linkedin.com/in/cadrlife/ for their guidance on formalization and pointing me in sensible directions, and Joanna Bryson333https://www.linkedin.com/in/bryson/ for openly sharing her raw insights on AI ethics. I am also grateful to Professor Jeremy Scerri444https://www.linkedin.com/in/jeremy-scerri-b1b7b713/ and Jessica Sciammarelli555https://www.linkedin.com/in/jessicasciammarelli/ for their support and for opening critical pathways in my broader learning.
Special thanks to Lio666https://www.linkedin.com/in/lionelcrescence/ for providing the hardware resources, and to the Cohere Labs community777https://cohere.com/research/open-science, led by Madeline888https://www.linkedin.com/in/madeline-smith-3a0b8b155/, for providing a welcoming and energizing environment for this research.
I am also grateful to the software engineering community leaders who have championed the practices central to this work: Bryan Finster999https://www.linkedin.com/in/bryan-finster/, Tracy Bannon101010https://www.linkedin.com/in/tracylbannon/, Patrik Debois111111https://www.linkedin.com/in/patrickdebois/, Matthew Skelton121212https://www.linkedin.com/in/matthewskelton/, and Rob Bowley131313https://www.linkedin.com/in/robertbowley/.
Finally, thank you to Erwan Keraudy141414https://www.linkedin.com/in/erwankeraudy/ and David Neil151515https://www.linkedin.com/in/david-neil-44a67217b/ for being there at the start, my business partner: Hugo Miralles161616https://www.linkedin.com/in/hugo-miralles/, and to my peers for their feedback and support: Oli, Natalia, Olgo, and Philip.
"There is one more thing, it’s been emotional."
References
- [1] M. Burgess, “A site configuration engine,” Computing Systems, vol. 8, no. 2, pp. 309–337, 1995, mIT Press: Cambridge MA.
- [2] ——, “Computer immunology,” Proceedings of the 12th Systems Administration Conference (LISA ’98), pp. 283–297, 1998. [Online]. Available: https://markburgess.org/papers/immune.pdf
- [3] ——, “An approach to understanding policy based on autonomy and promises,” in Proceedings of the 16th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management (DSOM). Springer, 2005, pp. 174–187.
- [4] ——, Promise Theory: Principles and Applications. Xtaxis Press, 2014.
- [5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, 2018.
- [6] S. J. Russell and P. Norvig, Artificial Intelligence: A Modern Approach, 4th ed. Pearson, 2020.
- [7] K. Erol, J. Hendler, and D. S. Nau, “Htn planning: Complexity and expressivity,” in Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI), vol. 94, 1994, pp. 1123–1128.
- [8] S. Kambhampati, K. Valmeekam, L. Guan, M. Verma, K. Stechly, S. Bhambri, L. Saldyt, and A. Murthy, “Llms can’t plan, but can help planning in llm-modulo frameworks,” arXiv preprint arXiv:2402.01817, 2024. [Online]. Available: https://arxiv.org/abs/2402.01817
- [9] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” 2022.
- [10] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
- [11] Z. Shen, R. Gay, and X. Tao, “Goal-based intelligent agents,” International Journal of Information Technology, vol. 9, 2014.
- [12] M. C. Feathers, Working Effectively with Legacy Code. Prentice Hall, 2004, introduces characterization testing as a technique for understanding legacy systems.
Appendix A Structural Resolution of the Credit Assignment Problem
A fundamental challenge in training and orchestrating agentic systems with Reinforcement Learning is the Credit Assignment Problem (CAP): determining which past action is responsible for the current reward. In standard Chain-of-Thought (CoT) or monolithic generation approaches, the “Time Horizon” between the error (e.g., a hallucinated variable at step ) and the penalty (e.g., a compilation error at step ) is large. This makes gradient attribution—or in-context correction—noisy and inefficient, as the agent must infer which specific token in the history caused the failure.
The Atomic Action Pair can serve as a structural solution to the Temporal CAP, effectively converting a sparse-reward search problem into a dense-reward optimization loop.
A.1 Collapsing the Reward Horizon
Standard autoregressive agents operate with a delayed reward horizon (), where validation occurs only after the complete generation of a complex artifact. By enforcing that every generative action () is immediately followed by a sensing action (), the Dual-State architecture collapses the reward horizon to .
Consequently, the “discount factor” for error correction approaches zero (). Every state transition in the Workflow State is gated by an immediate validity signal, ensuring that “blame” for failure is instantly and correctly assigned to the most recent generative attempt. This may explain the efficiency observed in the experiments ( cost), as the agent never expends compute continuing a trajectory that has already diverged from validity.
A.2 The Coach as a Reward Shaping Mechanism
While the Guard () provides a ground-truth signal, it is inherently sparse (). A sparse signal informs the agent that it failed, but not how to correct the error, potentially leading to random walk behavior during refinement.
The Coach (), defined in Section 6.1.1, addresses this by introducing Reward Shaping. It effectively approximates a value function by analyzing the Guard’s error trace () and projecting a dense semantic signal back into the Context ():
| (10) |
Appendix B Workflow Specification Format
Workflows are specified in a declarative JSON format inspired by PDDL (Planning Domain Definition Language). Each workflow defines a directed acyclic graph (DAG) of action pairs, where edges represent artifact dependencies.
B.1 Schema
B.2 Guard Types
-
•
syntax: AST parsing validation (G8)
-
•
dynamic_test: Runtime test execution (G10)
-
•
type: Static type checking via mypy (G9)
-
•
architecture: Layer boundary validation (G11)
B.3 Example: TDD Stack Task
The requires field creates an artifact dependency: g_impl receives the validated output of g_test via the {test_code} placeholder. This enables Test-Driven Development workflows where tests are generated first, then used to validate implementations.
Appendix C Guard Function Catalog
This appendix enumerates the deterministic guard functions () that enforce correctness constraints. Each guard validates a specific state transition.
Remark 7 (Human Oversight for Semantic Guards).
Certain guards validate artifacts that are semantically complex—where correctness cannot be verified by deterministic checks alone. These include:
-
•
Domain models (G1): Whether entities and invariants correctly capture business requirements
-
•
Generated test specifications (G4, G6): Whether BDD scenarios or architecture tests capture intended constraints
When the LLM generates artifacts that themselves become validators (guards generating guards), a HumanGuard checkpoint is essential. Without human meta-validation, errors in generated specifications propagate silently through the entire validation chain.
Guards marked with indicate recommended HumanGuard integration points.
Phase Overview
The guard catalog organizes 29 guards across 9 workflow phases, plus 6 bootstrapping guards for legacy systems (Appendix F):
| Phase | Name | Guards | Primary Concern |
|---|---|---|---|
| 1 | Architecture Definition | G–G | Domain model & structure |
| 2 | Test Definition | G5–G | Unit tests & BDD scenarios |
| 3 | Implementation | G7–G10 | Code generation & validation |
| 4 | Architectural Compliance | G11–G13 | Layer boundaries & DI |
| 5 | Behavioral Validation | G14–G15 | Acceptance & quality gates |
| 6 | Operational Safety | G16–G17 | Execution & file safety |
| 7 | Structure Audit | G18–G19 | Documentation sync |
| 8 | Version Control | G23 | Pre-commit validation |
| 9 | Human Oversight | G20 | Final approval checkpoint |
| – | Composite | G21–G22 | Guard composition patterns |
| – | Bootstrap (Legacy) | G24–G29 | Brownfield system support |
Phases 1–2 establish what to build, phases 3–5 ensure correctness, and phases 6–8 enforce safety.
Phase 1: Architecture Definition
Phase 2: Test Definition
Phase 3: Implementation
Phase 4: Architectural Compliance
Phase 5: Behavioral Validation
Phase 6: Operational Safety
Phase 7: Structure Audit
Phase 8: Version Control Safety
Phase 9: Human Oversight
Composite Guards
Appendix D Guard Library Implementation
Reference implementations in Python. All guards implement the GuardInterface:
D.1 SyntaxGuard (G8)
D.2 DynamicTestGuard (G10)
Executes generated tests against generated implementation:
D.3 ArchitectureBoundaryGuard (G11)
Validates Clean Architecture dependency rule:
D.4 HumanGuard (G20)
Pauses workflow for human approval:
D.5 CompositeGuard (G21)
Combines multiple guards with AND semantics:
D.6 PreCommitGuard (G23)
Validates staged changes before allowing a commit:
D.7 ParallelGuard
Independent guards can execute concurrently:
D.8 Bootstrap Guards
The following guards support bootstrapping legacy systems (see Appendix F).
Appendix E TDD Workflow Execution Trace
This appendix illustrates a complete TDD workflow execution for the Stack task.
E.1 Workflow DAG
E.2 Step 1: Test Generation (g_test)
Prompt:
Generation Attempt 1:
Guard (SyntaxGuard): (AST parses successfully)
State Transition: g_test VALIDATED
E.3 Step 2: Implementation Generation (g_impl)
Prompt (with artifact injection):
Generation Attempt 1:
Guard (DynamicTestGuard):
Context Refinement:
Generation Attempt 2:
Guard (DynamicTestGuard): (all tests pass)
State Transition: g_impl VALIDATED
E.4 Execution Summary
| Step | Attempts | Guard Result | Duration |
|---|---|---|---|
| g_test | 1 | 2.3s | |
| g_impl | 2 | 4.1s |
Total retries: 1
Total duration: 6.4s
Workflow state: COMPLETE
Appendix F Bootstrapping Legacy Systems
The framework as presented assumes workflows begin with explicit specifications. Legacy (“brownfield”) systems present a practical challenge: the specification exists only implicitly in running code. There are no tests to validate against, no documented invariants to enforce.
This appendix sketches how the framework might be extended to support bootstrapping—generating the missing validation infrastructure from existing codebases. The approach is not novel; it formalizes characterization testing practices that predate this work [12].
F.1 The Initialization Problem
In greenfield systems, guards have well-defined pass/fail semantics from the start. In brownfield systems, we face a chicken-and-egg problem: we cannot validate code without tests, but we cannot write tests without understanding the code’s actual behavior.
Formally, for a legacy artifact and guard , the initial state is undefined—we lack the predicate to evaluate. The bootstrapping problem is to construct that predicate by observing the system’s behavior.
F.2 Characterization Testing as Guard Generation
The standard TDD relationship (code must satisfy tests) reverses during bootstrapping: tests must satisfy code. The legacy system becomes the oracle.
| Approach | Oracle | Artifact Under Test |
|---|---|---|
| Standard TDD | Tests | Code must satisfy tests |
| Bootstrapping | Code | Tests must satisfy code |
This has a practical consequence: test failure during bootstrapping indicates a bug in the test, not the system under test. The guard accepts tests only when they pass against unmodified legacy code—including behavior that might be considered bugs in a greenfield context but are now load-bearing “features.”
F.3 A Three-Phase Pipeline
The bootstrapping process can be decomposed into three phases. These are not novel contributions—they reflect standard practice in legacy system modernization—but expressing them as guard predicates allows integration with the framework.
F.3.1 Phase I: Structural Audit
The first phase establishes what can be analyzed without execution.
Output: Dependency graph, module boundaries, entry points—the structural map needed for targeted characterization.
F.3.2 Phase II: Characterization Testing
Characterization tests capture what the system does, not what it should do. The legacy code is the oracle.
Remark 8 (Untestable Code).
Code that resists characterization testing often indicates dead code, error handlers for impossible conditions, or race conditions. These require manual analysis via HumanGuard (G20) rather than automated characterization.
F.3.3 Phase III: Constraint Promotion
Characterization tests become constraints. The system transitions from “undefined state” to “guarded state.”
Once G29 passes, the legacy system has bootstrapped into a standard guarded workflow—future changes must satisfy the characterization tests. The system transitions from “we don’t know what this code does” to “we have tests that document what it does, and changes must preserve that behavior.”
This is not a guarantee of correctness in any absolute sense. The characterization tests capture observed behavior, which may include bugs. The value is that changes are now guarded—regressions become detectable.