Autonomous Development Loops: How I Ship Infrastructure with AI and Self-Improving Judges
Where Things Stand
The telephony platform has grown from the original 4 CDK stacks to 15 stacks, 30 Lambda handlers, and 833 passing tests. It's deployed to a real AWS account, all 24 verification checks pass, and the readiness score sits at 8.18/10.
And I'm pausing.
Not because the code is done — there are 12 accepted P3 gaps — but because the remaining blockers are outside the code. An AWS support case needs resolution, and a telephony provider needs to handle inbound connectivity. There's also the possibility that something is wrong in our code that our techniques didn't catch, which would be frustrating given how much infrastructure we've built around catching exactly that.
This post documents the development techniques I built during this project so I can pick up exactly where I left off. It's also a honest look at what worked, what failed, and what I learned about using AI to ship real infrastructure.
The Evolution: 4 Stacks to 15
The earlier blog posts describe a 4-stack architecture. That was the starting point. The platform evolved through several architectural pivots:
| Change | Why | Impact |
|---|---|---|
| Wisdom-first conversation model | Eliminates the 8-second Connect Lambda timeout, enables unlimited conversation time | Removed all Lambda-based conversation loops, only disconnect handler remains |
| Routing flow stability layer | Phone numbers must stay associated with stable flow ARNs while implementation flows iterate | Added RoutingStack (14th stack), custom resource for phone number migration |
| BDA document processing | Document analysis requires Bedrock Data Automation integration | Added DataAutomationStack with S3 vault, BDA project, and 3 MCP tools |
| Contact Lens analytics | Real-time sentiment and transcription | Added Kinesis streaming, contact-lens-consumer Lambda |
| Demo scenario system | Two switchable demo environments for live presentations | Added DemoStack with scenario provisioner and seed data |
| Monitoring | 22 CloudWatch alarms across P0/P1/P2 severity | Added MonitoringStack |
Each change went through the OpenSpec workflow —
explore, propose, apply, archive. The specs in openspec/specs/ remained the source of
truth throughout.
The Autonomous Loop
The most powerful technique I developed is the autonomous development loop. It's a structured cycle that finds issues, fixes them, deploys, verifies, and learns — with minimal manual intervention.
Here's the flow:
┌─────────────────────────────────────────────────┐
│ AUTONOMOUS LOOP │
│ │
│ 1. Pre-flight staleness sweep │
│ └─ Fix stale docs, metrics, references │
│ │
│ 2. Readiness judges (5 parallel agents) │
│ └─ Score across 5 dimensions │
│ │
│ 3. Phase-by-phase fixes │
│ ├─ Documentation phase → commit │
│ ├─ Schema phase → commit │
│ ├─ Tests phase → commit │
│ ├─ Exploration phase → commit │
│ ├─ Gaps identification → commit │
│ └─ Knowledge phase → commit │
│ │
│ 4. Deploy │
│ └─ CDK deploy to real AWS account │
│ │
│ 5. Verify │
│ └─ 24 automated checks (health + scenarios) │
│ │
│ 6. Learn │
│ └─ Update knowledge base with findings │
│ │
│ └─ Repeat until diminishing returns │
└─────────────────────────────────────────────────┘
Each iteration commits after every phase, so you see progress as it happens and can pause at any point. The commit history tells the story:
chore: post-deploy loop — knowledge update, run 31e
chore: deploy telephony-3 + verify 24/24 PASS
chore: knowledge phase — iteration 3 findings, diminishing returns
chore: exploration phase — Zod-validate BDA S3 output
chore: schema phase — Zod-validate copilot consumer and analysis
Run 31 alone produced 24 new tests, Zod-validated 15 Lambda handler boundaries, refreshed 5 documentation files, and fixed 2 architectural issues (an unbounded tool-use loop and dead code) — across 3 iterations before reaching diminishing returns.
The Readiness Judge System
The core of the loop is a multi-dimensional LLM-as-Judge system with five specialized judges that run in parallel:
| Judge | Role | What it catches |
|---|---|---|
| J1 — Assumption Auditor | Classifies every claim as VERIFIED, UNVERIFIED, or FABRICATED | Stale test counts, outdated deployment status, metrics that don't match reality |
| J2 — Completeness Validator | Checks structural integrity and cross-references | Missing sections, broken links, inconsistent numbers across documents |
| J3 — Domain Expert | Challenges conclusions against real-world constraints | Security theater (CSP with unsafe-inline), unbounded loops, non-atomic operations |
| J4 — Frontend Specialist | Panel of accessibility, API patterns, and auth reviewers | Missing aria-labels, unvalidated API responses, session handling gaps |
| J5 — Ops Specialist | Panel of deploy, verify, and monitoring reviewers | Missing alarms, untested Lambda handlers, stale verification reports |
Each judge produces a score (1-10) and a confidence value (0-1). The composite score uses maturity-weighted averaging:
Composite = (J1 × 0.15) + (J2 × 0.10) + (J3 × 0.25) + (J4 × 0.20) + (J5 × 0.30)
The weights shift by maturity level. At staging maturity, J5 (ops) gets the highest weight because the question is "does it work end-to-end?" At prod maturity, J3 (domain) gets more weight because the question is "is it resilient?"
The 9.21 Disaster
The most important lesson came from the judges' worst failure.
On Run 23, the readiness score was 9.21/10 with 0.92 confidence using production criteria. The judges said the code was near-perfect. Then I deployed it for the first time.
16 failures across 30 deployment attempts. Zero of those failures were caught by the judges.
The root cause was simple: the judges evaluated code structure — tests pass, patterns are correct, documentation is complete. They never deployed. CDK synth passes but deploy fails because synth doesn't validate API-level constraints (ARN formats, account quotas, schema evolution with newer AWS services).
This triggered three changes:
- Maturity levels — dev ("can it deploy?"), staging ("does it work E2E?"), prod ("is it resilient?"). Each level focuses judges on the right criteria.
- Deployment grounding — J5 confidence is capped based on actual deployment state. Never deployed? Cap at 0.70. Last deploy failed? Cap at 0.75. Deployed but no soak test? Cap at 0.85.
- Retroactive labeling — all runs before Run 24 were relabeled as
prodmaturity. Scores across maturity levels aren't comparable.
After switching to staging maturity, the score dropped from 9.21 to 7.67. That drop was not regression — it was honesty.
A 9.0 score that can't deploy is worse than a 7.0 score that's grounded in reality.
Phase-by-Phase Execution
Early on, I tried running all fix phases in parallel — documentation, schemas, tests, exploration, gaps, knowledge — as a batch. The problem: you get a wall of results at the end with no ability to steer.
The fix was sequential phase-by-phase execution with commits between each phase:
Documentation phase — Fix staleness in reports, specs, skill references. This runs first because judges score against documentation. Stale docs drag J1 and J2 down regardless of code quality.
Schema phase — Fix type safety gaps. Remove .passthrough() from Zod schemas, add
validation to unvalidated Lambda handlers, ensure frontend API methods use shared schemas.
Tests phase — Add missing tests for the schema changes. Each new Zod validation gets at least 3 tests (valid input, malformed input, missing required fields).
Exploration phase — Deep investigation of architectural issues surfaced by J3. This is
where the unbounded while(true) tool-use loop was found and capped with MAX_TOOL_ROUNDS=10.
Gaps identification phase — Broader scanning beyond what judges flagged. Searches for
patterns like unvalidated JSON.parse, missing error handling, dead code.
Knowledge phase — Update skills and knowledge base with learnings from this iteration.
The benefit is visibility. After each phase commit, I can see what changed and decide whether to continue or adjust direction. Findings from earlier phases inform later ones — a schema fix in phase 2 might surface a test gap that phase 3 catches.
The Self-Improving Knowledge Base
The readiness judge system maintains three knowledge files that evolve with every run:
Criteria
Rubrics for each scoring dimension. These started generic and have been refined through 31 runs. Example evolution:
- Seed: "Measures how well claims are grounded in verifiable evidence"
- After Run 12: "When evaluating platforms with separate readiness reports, J1 should distinguish 'artifact accuracy' from 'codebase accuracy'"
- After Run 15: "Above 8.0, J3 becomes the constraining dimension. The gap between 'feature implemented' and 'feature resilient' is where the remaining points live"
Blind Spots
Patterns that judges frequently miss. Each entry tracks how many times it was triggered
and has a clean_runs counter. After 5 consecutive clean runs, a blind spot can be
archived. Currently tracking entries like:
- BS-008: Conflating designed vs implemented (4 hits) — "features in specs treated as implemented when code doesn't exist"
- BS-041: Readiness report drift (4 hits, Critical Watch) — "code improves faster than reports are updated"
- BS-003: Metric fabrication (2 hits) — "LLMs fill in plausible-sounding metrics that aren't computed from data"
Findings Log
A historical record of every run with scores, changes, and key insights. This is what enables trend analysis — you can see how the score evolved from 6.61 (Run 8, first 5-panel run) through the 9.21 peak (Run 23, ungrounded), the correction to 7.67 (Run 24, staging maturity), and the current 8.18 (Run 31, with deployment grounding).
The self-improvement cycle runs on every execution, not just failures:
After each run:
1. Check if any blind spots were triggered → increment hit count
2. Check for new patterns that should be blind spots → add them
3. Update criteria with calibration notes from scoring disagreements
4. Log the run with scores, changes, and key insights
Skill Refinement: Preventing Knowledge Loss
When you work on a project across many sessions, hard-won patterns get lost between conversations. A fix you applied in session 12 might be re-discovered (or worse, un-done) in session 15.
The solution is mandatory skill refinement after every archive. The project maintains
16 skills covering all domains — from code-quality (Biome, hardening rules) to
telephony-platform (Connect flows, Wisdom agent patterns) to readiness-judge
(evaluation criteria, blind spots).
After completing any OpenSpec change, the archive step triggers a skill audit:
- "What did we learn from this change that should go into skills?"
- Research agents audit each affected skill
- Updates applied in parallel (one agent per skill)
- Skill updates committed
This is how patterns like "minimum 1024 MB for all Lambdas — no exceptions" survive
across sessions. It's in the code-quality skill, checked on every Lambda creation.
Pre-Flight Staleness Sweep
Before judges run, a systematic scan fixes stale artifacts. This sounds minor but was the solution to a persistent problem: documentation drift caused score volatility.
In Run 12, the code was genuinely better — monitoring coverage went from 21/29 to 30/30 Lambdas, test count jumped from 512 to 873, Cognito auth was fixed. But the readiness report still said "21/29 monitoring" and "512 tests." J1 correctly penalized the stale report, and the score decreased despite real improvements.
The pre-flight sweep now checks:
- Test counts vs actual
pnpm testoutput - Stack counts vs actual CDK templates
- Lambda counts vs code
- Readiness report metrics vs verification commands
- Skill knowledge files for outdated code references
- Memory files for stale project status
- Deployment report dates and outcomes
Everything is fixed before judges see it. This eliminated the "stale docs penalty" that caused J1/J2 to oscillate.
What Didn't Work
Not everything I tried was successful:
Parallel agent batches — Running all fix phases simultaneously. Too much context pollution, no ability to steer, and later phases couldn't benefit from earlier findings.
Prod maturity criteria too early — Scoring against production criteria (circuit breakers, retry jitter, ops runbooks) when the platform had never been deployed. This produced artificially high scores that meant nothing.
Judge panel without deployment grounding — The 9.21 disaster. Five judges scoring code quality gave a false sense of readiness. Adding the deployment cap was the single most important improvement.
Updating reports as a separate step — Documentation drift is the persistent drag on readiness scores. Reports should be updated as part of implementation commits, not only during readiness runs. I learned this the hard way over multiple runs where J1 scored 2+ points below J5 because the code was better than the docs claimed.
The Current State
Here's where the platform sits today:
| Metric | Value |
|---|---|
| CDK stacks | 15 (all deployed, 9 updated on last deploy) |
| Lambda handlers | 30, all 1024 MB minimum, X-Ray traced |
| Zod-validated boundaries | All 30 handlers |
| Tests | 833 passing (691 infra + 142 frontend) |
| Verify checks | 24/24 PASS (16 health + 7 scenario + 1 round-trip) |
| CloudWatch alarms | 22 across P0/P1/P2 |
| Readiness score | 8.18/10 (confidence 0.85, staging maturity) |
| OpenSpec changes | 26 archived, 0 active |
| Blind spots tracked | 66 total, ~10 active |
| Judge runs | 31 (across 2 weeks) |
Architecture
The platform runs on a Wisdom-first conversation model — a single Amazon Q in Connect (Wisdom) agent handles all AI conversations, with no Lambda in the call path. This gives unlimited conversation time (the previous architecture was constrained by Connect's 8-second Lambda timeout).
Six MCP tools are registered with the agent: knowledge base search, outcome reporting, document upload/status/query, and escalation. The routing layer uses TransferToFlow patterns so phone numbers stay stable while implementation flows iterate.
What's Blocking
Three things are outside the code:
- AWS support case — a regional API issue (KB association returns 500 in eu-central-1) that requires manual console configuration as a workaround
- Telephony provider connectivity — inbound call routing needs provider-side setup.
The original UK DID number never received inbound calls despite correct Connect
configuration — we switched to a UK Toll-Free number (
+44 808 812 7178) which uses multi-carrier SOMOS routing and should be more reliable - Potential code gap — there's a possibility that the conversation flow has a bug that our verification (unit tests, deploy checks, scenario validation) didn't catch. The platform has never received a real phone call.
The third one is the most concerning. The readiness system is designed to catch structural issues, but it can't simulate a real caller speaking to the AI through Connect's media pipeline. That's the gap between staging maturity (8.18) and production maturity — you need actual traffic.
The Meta-Lesson
Building the readiness judge system took significant effort — 31 runs, 3 knowledge files, 66 blind spots, maturity levels, deployment grounding, staleness sweeps. Was it worth it?
Yes, but not for the reason I expected.
The value isn't in the final score. A number like 8.18 is useful for tracking trends, but the real value is in what the system forces you to do:
- Verify your assumptions — J1 catches claims that sound right but aren't backed by evidence
- Check your coverage — J2 finds the section you forgot to update after a refactor
- Challenge your design — J3 asks "would this actually work under adversarial conditions?"
- Test your interfaces — J4 catches the aria-label you missed and the API response you didn't validate
- Ground in reality — J5 asks "have you actually deployed this?" and caps your confidence if you haven't
Each of these is a question I should have been asking myself all along. The system automates the discipline.
The autonomous loop is the same idea at a larger scale: find issues, fix them, verify the fix, learn from it, repeat. The phases prevent overwhelm. The commits prevent lost work. The knowledge base prevents re-learning.
When I come back to this project — after the support case resolves, after the telephony provider sorts out connectivity — I'll have 16 skills, 38 specs, 31 runs of findings, and a readiness system that knows what to check. That's the real output of this work: not just a deployed platform, but a system for maintaining it.
The best time to build your verification infrastructure is before you need it. The second best time is after your 9.21-scoring code fails 16 times on first deploy.