AI‑Spec‑Driven Development with DevLoop: a practical working model
Introduction
Most teams trying to “use AI in development” bolt a code‑generator onto a traditional process and hope for speed. They usually get fragments: faster stubs, but the same misunderstandings, drift, and rework. Here agents.md (https://agents.md/.) is a definite improvement: it’s a solid public standard for AI code generation supported by more and more code generation agents (OpenAI’s Codex, Github Copilot etc), It's purpose it to give coding agents a consistent way to read intent and emit code and tests. On its own, though, it improves local throughput (scaffolds, stubs, small features) without fixing the end-to-end bottlenecks of the Software Delivery Life Cycle. To get the systemic gains teams expect, you need AI-Spec-Driven Development with the DevLoop—a projec "way of working" which providing a "rhythm" that connects the spec to generation and to proof, every day, across the Software Delivery Life Cycle.
In this approach, the Executable Product Spec (XPS) is written as agents.md, but its role is elevated: it’s the single, living contract for humans, agents, and CI. Humans write intent and user-visible behavior in plain language; agents consume the same file to generate code, tests, and scaffolding; CI interprets a few spec-level claims (oracles) and returns evidence. Work advances in one tight loop—Specify → Generate → Prove → Refine—applied at every level (product, feature, module). When evidence disagrees with the spec, you either update the spec (intent changed) or the code (implementation wrong) and you loop again. This is what turns agents.md from a code-gen convenience into a delivery mechanism that raises quality and speed across the whole lifecycle.
Two properties make it practical. First, the spec is human-first: it opens with purpose and user-visible behavior so reviewers get clarity before they see wiring. Second, it is agent-readable in the same place: it names inputs/outputs, oracles for CI, and minimal architecture/layout rules, so generators can scaffold safely and pipelines can assert facts (e.g., “artifact exists and is non-empty; exit code is zero”). There’s no parallel source of truth to reconcile; the XPS is the product’s heartbeat.
A working tour of the app (before the XPS)
To make the DevLoop concrete, we’ll use a real, running module: text2audio. It converts a short Markdown/Text script into a spoken audio file, optionally translates first, and streams synthesis directly to disk. Repository: https://github.com/soyrochus/text2audio/
From a user’s point of view, the app is a small, predictable CLI. You point it at a text file and an output path, choose a voice/model/format, and it produces an audio artifact while showing progress. When it finishes, it prints the absolute path to the file and a one-line summary (format, voice). If something is off—missing key, invalid voice/model—the command stops with a clear message and a non-zero exit code. That’s it: simple to run, easy to reason about, and testable.
Minimal run (happy path)
Create a tiny prompt file:
printf "Hello, this is text2audio.\n" > examples/hello.md
Run the tool:
export OPENAI_API_KEY=sk-...redacted...
uv sync
uv run python -m text2audio \
--prompt-file examples/hello.md \
--audio-file out/hello.mp3 \
--audio-format mp3 \
--language english \
--voice alloy \
--tts-model tts-1-hd
What you see while it runs is intentionally boring and informative:
The last lines include something like:
AudioGenerated: out/hello.mp3 (mp3, voice=alloy, model=tts-1-hd, lang=en, ~2.1s)
DONE out/hello.mp3
There is no giant in-memory buffer; audio bytes are streamed to disk. A quick test -s out/hello.mp3 (or ffprobe) confirms the file exists and isn’t empty.
Two small variations (behavior, not ceremony)
Skip translation and keep the source language:
uv run python -m text2audio \
--prompt-file examples/hello.md \
--audio-file out/hello_native.mp3 \
--no-translate \
--voice alloy --tts-model tts-1-hd
Model-specific instructions (only for models that support them, e.g., gpt-4o-mini-tts):
uv run python -m text2audio \
--prompt-file examples/hello.md \
--audio-file out/hello_instruct.mp3 \
--tts-model gpt-4o-mini-tts \
--voice nova \
--instructions "warm, friendly narrator"
When a model ignores --instructions by design, the run still succeeds—just without applying the extra guidance. That distinction is deliberate and verifiable later in CI.
Utilities you actually use
You don’t guess voice names; you ask the app:
uv run python -m text2audio --list-voices
And when you want to check which voices really work in your environment, you probe them with ~1-second clips and get a crisp summary:
uv run python -m text2audio --probe-voices --tts-model tts-1-hd --audio-format mp3
# ...
# VoiceProbeCompleted: working=[alloy, verse, ...], failed=[...]
Both utilities exit cleanly after printing—useful in local dev and in CI diagnostics.
Negative path (by design)
If the API key is missing, the command fails fast, prints a clear explanation (with the key redacted, never echoed), creates no file, and returns a specific exit code. That behavior is intentional, repeatable, and checked later by CI.
unset OPENAI_API_KEY
uv run python -m text2audio \
--prompt-file examples/hello.md \
--audio-file out/should_not_exist.mp3 || echo "exit=$?"
# -> error: OPENAI_API_KEY missing
# -> exit=2
What’s happening under the hood (only what you need to know now)
The CLI layer handles arguments, environment checks, and dispatch. A small “model” layer coordinates optional translation and TTS; it streams bytes to disk and reports progress. “Views” are just the textual progress and summaries you saw above. The app emits a few named events (e.g., AudioGenerated) so logs and dashboards speak the same language as the spec. Secrets are redacted centrally. Re-running with the same inputs is safe and simply overwrites the target file.
Why this narrative matters to the DevLoop
Everything you just did is straightforward to Specify in a short agents.md (the XPS), trivial to Generate with an agent (argument parsing, guards, progress output), and easy to Prove in CI (happy path, missing-key, model-semantics). When behavior changes, you adjust the spec—often a single sentence—regenerate, and rerun. No ceremony; just incremental edits that keep code, tests, and expectations aligned. The full XPS for this module captures exactly that, section by section.
The model in practice
Think of the XPS as the contract of reality. It is short, human-first, and specific enough for agents and CI to act on. The order of sections matters because it lines up with how people think and how machines build:
This removes drift. There’s one page where decisions live and one loop that keeps them honest.
A Deep Dive into the Executable Product Specification as an Agent.md for text2audio: A practical example:;
Using the reference application as defined in https://github.com/soyrochus/text2audio/ we will build up the XPS in order, using the real module text2audio as our running example. The purpose of the example is not to exhaust every detail, but to show how each section earns its keep and how you evolve it incrementally.
Purpose & Promise — say what it does and why it matters
You open agents.md by explaining the transformation and the value, in a few lines. This anchors everyone before any code is written.
text2audio turns a short Markdown/Text script into a spoken audio file you can publish. It can translate to a target language, then synthesize with an OpenAI TTS model, streaming to disk to avoid large memory buffers. It shows clear progress, offers voice utilities, and can play the result locally. Outputs include mp3, wav, opus, and aac.
Why this section exists: it aligns expectations in seconds and already encodes a non-functional promise (streaming).
How it evolves: if you later decide to chunk long inputs by headings, you add one sentence here. That single sentence drives the next loop of code and CI.
User-visible Behavior — describe views and actions literally
Keep it in natural language: what screens/commands exist, what they accept, and what must be true before and after.
Run accepts: prompt_file, audio_file, audio_format, language, tts_model, voice, speed, optional instructions, optional no_translate. It runs only if OPENAI_API_KEY is set and prompt_file exists. On success, it prints the final audio_file path and emits AudioGenerated { audio_file, format, language, tts_model, voice, ~duration_s }. If a voice/model is rejected, keep the view and print a hint. Utilities: VoiceList (list voices), VoiceProbe (short per-voice clips and a summary).
Why this section exists: it doubles as acceptance criteria and a contract for scaffolding.
How it evolves: adding --no-translate is one line here. The agent updates argument parsing and branching; CI gains a semantic check.
Inputs & Outputs at a glance — parameters + tangible result
A tiny table is enough for agents to wire parsing and for CI to assert artifacts.
Outcome: audio_file exists and is non-empty. Evidence: an AudioGenerated { ... } event in logs/telemetry.
Why this section exists: parameters stop being vague; the outcome becomes checkable. How it evolves: need loudness normalization? Add normalize_lufs and extend “Outcome” with a LUFS range. One line in the spec; one more CI assertion.
Oracles — what CI must prove this iteration
Two to four claims. If they hold, the module is “good” for now.
Why this section exists: oracles are how the spec becomes executable.
How it evolves: when you add normalize_lufs, add S2: “Given --normalize-lufs -16, measured loudness is −16 LUFS ±1.” CI grows a new job; the model adds a normalization step.
Quality, Safety & Policy — name the rules before wiring
Security and operability shouldn’t be inferred. Write them down and let tooling enforce them.
Stream to disk; never buffer full audio in memory. Never print OPENAI_API_KEY (redact on error). Pragmatic idempotency: same inputs may overwrite the same audio_file. Exit codes: MISSING_API_KEY=2, FILE_NOT_FOUND=3, VOICE_OR_MODEL_REJECTED=4, unexpected=1. Logging: concise INFO, timings at DEBUG, no secrets. Accessibility: always print the final audio_file path and a one-line summary.
Why this section exists: it lets generators synthesize guard code and lets reviews be objective.
How it evolves: forbid network calls in views? Add a line; import-graph checks can fail non-compliant PRs.
Domain Snapshot — stabilize names for artifacts and events
This keeps your vocabulary consistent across code, logs, and dashboards.
Why this section exists: names become reusable handles in telemetry and tests.
How it evolves: new utilities introduce new events; you append them here first.
Architecture & Layout — boundaries and import rules
Give the minimum structure agents and reviewers need; keep it enforceable.
cli (entrypoint, args, env; no business logic) model (translation + TTS orchestration; streams to disk; functions as pure as practical) views (progress/tables/summaries; call the model, not the APIs) Allowed imports: cli→model, cli→views, views→model. Forbidden: model importing cli or views. Testing: unit for pure functions; short end-to-end; golden tests for views.
Why this section exists: it prevents spaghetti and enables safe scaffolding.
How it evolves: if you add events.py/errors.py, add the rule “helpers import nothing app-specific” and enforce it with a simple import-graph check.
How to Run — one happy path for humans and CI
A newcomer should be able to copy-paste and get a real artifact; CI should do the same.
uv sync
export OPENAI_API_KEY=sk-...redacted...
uv run python -m text2audio \
--prompt-file examples/description-text2audio.md \
--audio-file out/description-text2audio.mp3 \
--audio-format mp3 \
--language english \
--voice alloy \
--tts-model tts-1-hd
Utilities stay discoverable but optional:
uv run python -m text2audio --list-voices
uv run python -m text2audio --probe-voices --tts-model tts-1-hd --audio-format mp3
Why this section exists: it shortens onboarding and removes guesswork from CI.
How it evolves: extend it only when a new capability becomes part of the happy path.
Boundaries & Non-goals — deliberate scope cuts
State what is out of scope for this iteration so decisions are visible and test plans stay lean.
No DB or cache. No .srt subtitles. No segmentation by headings (single output file per run). No Windows playback guarantee.
Why this section exists: it prevents accidental scope creep and misfiled “bugs”.
How it evolves: when a boundary moves, change it here first, then add/adjust oracles and regenerate code/tests.
What changes when you work this way
This is the difference between “using AI to write code” and AI-Spec-Driven Development with DevLoop. The first accelerates typing. The second turns a short, human-first specification into a delivery engine that scales across the lifecycle.
Again, the example module is here (working code): https://github.com/soyrochus/text2audio/ And the file convention is documented here: https://agents.md/
Director, Head of Solutioning and Innovation (C&CA) @ Capgemini Spain
1moCurrently looking into https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/ as posted by Angel Llosa Guillen to see where these two approaches match/diverge. I am always open to discard something invented locally for an available OSS or public spec. I would hope that "Not Invented Here" is not one of my catch phrases...