Why Your AI Coding Agent Keeps Going Off-Script — And How to Fix It
How spec-driven development turns unpredictable AI agents into reliable software factories through deterministic orchestration, bounded execution, and automated evaluation.
The Gap Nobody Talks About
Every team we work with has the same story. They gave developers Copilot or Cursor. Completion times dropped. Individual productivity went up. Then someone asked: *"So why aren't we shipping features faster?"*
The answer is deceptively simple. AI coding assistants make the coding faster, but coding was never the bottleneck. The bottleneck is everything between "the business wants X" and "X is live in production" — the handoffs, the context loss, the decisions made in Slack threads that nobody can find six months later.
We've spent the past year building agentic workflows for clients across fintech, proptech, and enterprise SaaS. The pattern that works looks nothing like "give an AI agent a task and let it figure things out." It looks more like a factory floor: structured, sequential, and deliberately boring in its orchestration — with AI doing the creative heavy lifting inside tightly bounded stages.
This post breaks down what we've learned.
The Core Problem: Agents Can't Self-Manage
Here's what happens when you let an AI agent orchestrate its own workflow on a non-trivial codebase:
- •It skips steps. Requirements analysis? The agent jumps straight to code.
- •It creates circular dependencies. Task A needs Task B, which needs Task A.
- •It loses context mid-flight. By the time the agent is writing its fifth file, it's forgotten architectural decisions it made on the first.
- •It produces inconsistent results. Two developers prompting the same model get structurally different outputs. You can't QA what you can't predict.
On a weekend project, this is fine. On a production system with compliance requirements, multiple teams, and a codebase that's been evolving for three years, it's a non-starter.
The insight that changed our approach: agents are excellent executors but terrible managers. They generate high-quality content within a well-defined problem space. They fail at meta-level decisions — what to do next, when work is "done enough," how different workstreams interact.
Figure 1: The Problem Spectrum
The Two-Layer Architecture
The pattern that works consistently separates deterministic orchestration from bounded agent execution. Two layers, two different jobs, never mixed.
Layer 1: Orchestration (no AI, no judgment calls)
The orchestration layer is a conventional workflow engine — think state machine, not neural network. It does exactly four things:
- 1.Enforces phase order. Requirements -> Architecture -> Tasks -> Implementation. No skipping.
- 2.Manages dependencies. Task B can't start until Task A is complete.
- 3.Tracks artifact state. Every document carries a status (
draft->in-review->approved->complete) in machine-readable metadata. The engine reads this, not vibes. - 4.Triggers the right agent at the right time. "When REQ-001 is approved, generate technical tasks" is a rule. "Figure out what to do next" is a prayer.
The orchestration runs *around* the agents, not *through* them. Agents never decide what phase they're in or what comes next. They receive a bounded task, produce an artifact, and hand control back to the engine.
Layer 2: Execution (AI does the creative work)
Within each phase, specialised agents do what they're good at:
- •A requirements agent analyses business intent and produces structured specs
- •An architecture agent proposes technical design within established conventions
- •A coding agent implements against a precise task specification
- •A knowledge agent that other agents query for project context, conventions, and prior decisions
Each agent is essentially a skill — a bounded set of instructions, templates, and evaluation criteria for a specific type of work. Like microservices, we trade one complex general-purpose agent for multiple simpler ones with clear interfaces between them.
Figure 2: Two-Layer Architecture
The Review Gate: How Quality Stays Consistent
Every agent output passes through a two-stage review gate before the workflow advances.
Stage 1 — Deterministic checks (fast, binary):
- •Required metadata present? (IDs, status, parent references)
- •All mandatory sections filled? (Description, Acceptance Criteria, Dependencies)
- •Cross-references valid? (Does TASK-001 actually reference an existing REQ?)
- •For code: linters pass, tests pass, coverage thresholds met, architectural boundaries respected
Stage 2 — Critic agent (slower, judgment-based):
- •"Are these acceptance criteria actually testable?"
- •"Does this architecture follow the project's established patterns?"
- •"Does the implementation match the task spec?"
If either stage rejects the output, the producing agent iterates — still within the same phase — until it passes both. We cap retries at 3-5 attempts. If the agent can't pass the gate, the workflow stops and escalates to a human.
A note on honesty here: the review gate catches structural problems and obvious consistency failures reliably. It does *not* replace human architectural judgment. Subtle business logic errors, cross-cutting security concerns, and "this technically works but is the wrong abstraction" decisions still need human eyes. The gate's job is to raise the floor — ensuring that what reaches a human reviewer has already passed a baseline quality threshold — not to eliminate review entirely. In regulated domains especially, the human review at PR stage isn't optional overhead; it's where the highest-value judgment happens.
The critical design choice: evaluations run on the output artifacts (files), not on conversational responses. The agent can ramble all it wants in its reasoning; what matters is whether the markdown file, the code, and the tests meet the definition of done.
The File System Is the Workflow Engine
Here's where it gets practical. The folder structure isn't organisational preference — it's the workflow engine's API.
.sdlc/
context/ # Persistent — applies to all features
project-overview.md # System scope, tech stack, constraints
architecture.md # Architectural decisions and patterns
conventions.md # Naming, structure, coding standards
templates/ # Reusable artifact templates
requirement-template.md
task-template.md
specs/ # Per-feature — one folder per requirement
REQ-001-notification-system/
requirement.md
tasks/
TASK-001-notification-service.md
TASK-002-email-channel.md
TASK-003-push-channel.md
knowledge/ # Growing knowledge base
assumptions.md # Logged by agents, reviewed in PRs
decisions.md # Approved architectural decisions
src/ # Application source code
tests/ # Test suites
AGENTS.md # Root-level agent instructionsHow the Structure Drives Automation
Every convention serves the automation:
| Convention | What it tells the engine |
|---|---|
| `context/` vs `specs/` | Persistent project knowledge vs feature-specific work |
| `REQ-001-*/` folder naming | Parent-child relationships between requirements and tasks |
| `TASK-*` prefix inside `tasks/` | Dependency graph parsing and execution ordering |
| Frontmatter `status: draft` | Which artifacts are ready, blocked, or complete |
| `parent: REQ-001` in task metadata | Traceability chain from code back to business requirement |
The workflow engine reads this structure like an API. It doesn't need a database, a project management tool, or a separate tracking system. The repo is the single source of truth — for the code, the specs, the decisions, and the workflow state.
Figure 3: The Traceability Chain
The Knowledge Feedback Loop
One of the most underappreciated problems with agentic development is what happens when the agent doesn't know something.
A requirements agent is breaking down a notification feature. It needs to know: *"Should notifications be synchronous or queued?"* Three things can happen:
- 1.The answer exists in the project knowledge base — the knowledge agent returns it, and the requirements agent proceeds with a grounded decision.
- 2.The answer doesn't exist — the knowledge agent logs the question and the assumption the agent will make (e.g., "Assuming async queue-based delivery"). The assumption appears as structured data in the pull request.
- 3.The question is ambiguous — it's flagged for human review alongside the PR.
Here's why this matters: when a reviewer approves the PR, they're approving the assumptions alongside the code. When they reject an assumption, the correction feeds back into the knowledge base as a documented decision. Over time, agents ask fewer unanswered questions because the knowledge base grows organically from real gaps they encountered.
Figure 4: Knowledge Feedback Loop
Beyond Waterfall: Regenerative Software Engineering
The obvious question is: "Isn't this just waterfall?"
It's a fair challenge, and the honest answer is more interesting than "yes" or "no."
Waterfall failed for reasons that go beyond economics. Yes, writing specs took months and documents went stale — but waterfall also failed because it assumed you could *discover* requirements upfront. In reality, requirements emerge through building. Stakeholder incentives shift. Edge cases only surface during implementation. The feedback loop between "what we thought we needed" and "what we actually need" was measured in months, which is too slow for the loop to function at all.
Agile's response was correct in principle — shorter cycles, faster feedback, embrace change — but in practice it often meant abandoning structure altogether. Teams shipped faster but lost traceability, architectural consistency, and the ability to explain *why* the system works the way it does.
What agents enable is something different from both. We think of it as regenerative software engineering: the ability to regenerate the full specification-to-implementation chain cheaply enough that specs become disposable and reproducible rather than precious and stale.
The distinction matters:
Waterfall: Write specs once -> implement once -> specs decay -> nobody updates them.
Agile: Skip formal specs -> implement iteratively -> knowledge lives in people's heads -> traceability is an afterthought.
Regenerative: Generate specs -> implement -> requirements change -> *regenerate the entire chain* from updated inputs. The specs are never stale because they're never maintained — they're reproduced.
This changes the relationship between structure and agility. You don't choose between them. A product manager kicks off three competing feature experiments on Monday morning and reviews working implementations — complete with specs, architecture decisions, and traceability — by afternoon. If the business context shifts on Tuesday, you don't patch stale documents. You run a fresh cycle.
The phased structure gives you what waterfall promised — traceability, architectural consistency, auditable decision trails. The regenerative economics give you what Agile promised — rapid iteration, embrace of change, fast feedback loops. Neither methodology delivered both. This pattern can, when it works well.
| Dimension | Waterfall | Agile | Regenerative (SDD) |
|---|---|---|---|
| Specs | Written once, decay | Minimal/skipped | Generated, disposable, reproducible |
| Feedback loop | Months | Weeks/sprints | Hours |
| Traceability | High initially, degrades | Low | High, maintained by regeneration |
| Handles change | Poorly (costly rework) | Well (but loses structure) | Well (regenerate from new inputs) |
| Knowledge location | Documents (stale) | People's heads | Repository (living, versioned) |
Regenerative software engineering isn't a return to waterfall — it resolves the tradeoff between structure and agility that forced teams to choose between them.
What We've Learned Building This for Clients
After implementing this pattern across multiple engagements, here are the non-obvious lessons:
The organisational change is harder than the technical change. If your team's decisions live in Slack threads, hallway conversations, and "everyone just knows," agents can't participate. The discipline of writing things down — architecture decisions, conventions, the reasoning behind choices — is the real prerequisite. The tooling is secondary.
Start with the review gates, not the agents. Before you automate generation, define what "good" looks like for each artifact type. Write the evaluation criteria. Build the deterministic checks. The agents can be swapped, upgraded, or replaced; the quality gates are your actual product.
Throughput beats speed. The right question isn't "how do we make developers faster?" It's "how many feature experiments can we run before lunch?" The shift from accelerating individual tasks to parallelising entire feature cycles is where the real value compounds.
The knowledge base compounds — but needs maintenance. After 50+ features through this pipeline, the project's knowledge base becomes genuinely valuable. Agents hit fewer unanswered questions. Architectural consistency improves. New team members (human or AI) onboard faster because conventions are documented where they're used, not in a wiki nobody reads. But knowledge bases also drift, accumulate contradictions, and miss edge cases that only surface in production. Treat the knowledge base like code — it needs periodic review, pruning, and refactoring. The feedback loop helps, but it doesn't eliminate entropy.
Getting Started
You don't need to build the full pipeline on day one. Here's a practical progression:
- 1.Week 1: Create
.sdlc/context/with your project overview, architecture decisions, and coding conventions. This alone improves agent output dramatically — even with manual prompting.
- 2.Week 2-3: Add templates for requirements and tasks. Start generating specs through agents and reviewing them in PRs alongside code.
- 3.Week 4-6: Implement review gates — deterministic checks first (they're faster to build and catch most issues), then add a critic agent for judgment-based validation.
- 4.Month 2+: Build the orchestration layer to connect phases automatically. Add the knowledge agent. At this point, humans only enter at PR review.
Each step delivers value independently. You don't need the full system to start seeing better agent outputs — structured context and templates get you 60% of the way there.
Where This Is Heading
The pattern described here — deterministic orchestration, bounded execution, automated evaluation, regenerative cycles — is converging across the industry. Tools like Spec Kit and Kiro are building products around these exact principles. Agent platforms are standardising on skill-based architectures (SKILL.md files) that encode domain expertise as reusable modules.
We think the next frontier is cross-repository orchestration — agents that understand how a change in Service A affects Service B's API contract, coordinated through the same deterministic workflow engine. The principles are identical; the scope expands.
But we want to be honest about where the edges are. This pattern works well for feature development with clear business requirements. It's less proven for exploratory R&D, complex refactoring of deeply coupled legacy systems, or domains where the "right answer" can't be evaluated programmatically. Knowing where the pattern applies — and where it doesn't — is as important as implementing it well.
For the domains where it does apply, the teams getting the most value have stopped asking "how do we make developers faster?" and started asking something more interesting: "What becomes economically viable to attempt when the cost of a structured, traceable feature cycle drops to an order of magnitude cheaper?"
That's the question worth answering.
Related Articles
AI Governance in Saudi Arabia: Building the Technical Foundations for Responsible AI at Scale
How PII redaction, document classification, and data governance are becoming critical capabilities for organisations operating under the Kingdom's rapidly evolving regulatory framework.
AI & Machine LearningvLLM and Parallelized Inference: Scaling LLM Serving to Production
Deep dive into vLLM architecture, continuous batching, PagedAttention, tensor parallelism, and advanced techniques for serving large language models at scale with optimal throughput and latency.