Skip to content
dawalnut
Menu

Autonomous Development Loops: How I Ship Infrastructure with AI and Self-Improving Judges

aideveloper-experienceawsarchitecturetypescript

Where Things Stand

The telephony platform has grown from the original 4 CDK stacks to 15 stacks, 30 Lambda handlers, and 833 passing tests. It's deployed to a real AWS account, all 24 verification checks pass, and the readiness score sits at 8.18/10.

And I'm pausing.

Not because the code is done — there are 12 accepted P3 gaps — but because the remaining blockers are outside the code. An AWS support case needs resolution, and a telephony provider needs to handle inbound connectivity. There's also the possibility that something is wrong in our code that our techniques didn't catch, which would be frustrating given how much infrastructure we've built around catching exactly that.

This post documents the development techniques I built during this project so I can pick up exactly where I left off. It's also a honest look at what worked, what failed, and what I learned about using AI to ship real infrastructure.

The Evolution: 4 Stacks to 15

The earlier blog posts describe a 4-stack architecture. That was the starting point. The platform evolved through several architectural pivots:

ChangeWhyImpact
Wisdom-first conversation modelEliminates the 8-second Connect Lambda timeout, enables unlimited conversation timeRemoved all Lambda-based conversation loops, only disconnect handler remains
Routing flow stability layerPhone numbers must stay associated with stable flow ARNs while implementation flows iterateAdded RoutingStack (14th stack), custom resource for phone number migration
BDA document processingDocument analysis requires Bedrock Data Automation integrationAdded DataAutomationStack with S3 vault, BDA project, and 3 MCP tools
Contact Lens analyticsReal-time sentiment and transcriptionAdded Kinesis streaming, contact-lens-consumer Lambda
Demo scenario systemTwo switchable demo environments for live presentationsAdded DemoStack with scenario provisioner and seed data
Monitoring22 CloudWatch alarms across P0/P1/P2 severityAdded MonitoringStack

Each change went through the OpenSpec workflow — explore, propose, apply, archive. The specs in openspec/specs/ remained the source of truth throughout.

The Autonomous Loop

The most powerful technique I developed is the autonomous development loop. It's a structured cycle that finds issues, fixes them, deploys, verifies, and learns — with minimal manual intervention.

Here's the flow:

┌─────────────────────────────────────────────────┐
│                 AUTONOMOUS LOOP                 │
│                                                 │
│  1. Pre-flight staleness sweep                  │
│     └─ Fix stale docs, metrics, references      │
│                                                 │
│  2. Readiness judges (5 parallel agents)        │
│     └─ Score across 5 dimensions                │
│                                                 │
│  3. Phase-by-phase fixes                        │
│     ├─ Documentation phase    → commit          │
│     ├─ Schema phase           → commit          │
│     ├─ Tests phase            → commit          │
│     ├─ Exploration phase      → commit          │
│     ├─ Gaps identification    → commit          │
│     └─ Knowledge phase        → commit          │
│                                                 │
│  4. Deploy                                      │
│     └─ CDK deploy to real AWS account           │
│                                                 │
│  5. Verify                                      │
│     └─ 24 automated checks (health + scenarios) │
│                                                 │
│  6. Learn                                       │
│     └─ Update knowledge base with findings      │
│                                                 │
│  └─ Repeat until diminishing returns            │
└─────────────────────────────────────────────────┘

Each iteration commits after every phase, so you see progress as it happens and can pause at any point. The commit history tells the story:

chore: post-deploy loop — knowledge update, run 31e
chore: deploy telephony-3 + verify 24/24 PASS
chore: knowledge phase — iteration 3 findings, diminishing returns
chore: exploration phase — Zod-validate BDA S3 output
chore: schema phase — Zod-validate copilot consumer and analysis

Run 31 alone produced 24 new tests, Zod-validated 15 Lambda handler boundaries, refreshed 5 documentation files, and fixed 2 architectural issues (an unbounded tool-use loop and dead code) — across 3 iterations before reaching diminishing returns.

The Readiness Judge System

The core of the loop is a multi-dimensional LLM-as-Judge system with five specialized judges that run in parallel:

JudgeRoleWhat it catches
J1 — Assumption AuditorClassifies every claim as VERIFIED, UNVERIFIED, or FABRICATEDStale test counts, outdated deployment status, metrics that don't match reality
J2 — Completeness ValidatorChecks structural integrity and cross-referencesMissing sections, broken links, inconsistent numbers across documents
J3 — Domain ExpertChallenges conclusions against real-world constraintsSecurity theater (CSP with unsafe-inline), unbounded loops, non-atomic operations
J4 — Frontend SpecialistPanel of accessibility, API patterns, and auth reviewersMissing aria-labels, unvalidated API responses, session handling gaps
J5 — Ops SpecialistPanel of deploy, verify, and monitoring reviewersMissing alarms, untested Lambda handlers, stale verification reports

Each judge produces a score (1-10) and a confidence value (0-1). The composite score uses maturity-weighted averaging:

Composite = (J1 × 0.15) + (J2 × 0.10) + (J3 × 0.25) + (J4 × 0.20) + (J5 × 0.30)

The weights shift by maturity level. At staging maturity, J5 (ops) gets the highest weight because the question is "does it work end-to-end?" At prod maturity, J3 (domain) gets more weight because the question is "is it resilient?"

The 9.21 Disaster

The most important lesson came from the judges' worst failure.

On Run 23, the readiness score was 9.21/10 with 0.92 confidence using production criteria. The judges said the code was near-perfect. Then I deployed it for the first time.

16 failures across 30 deployment attempts. Zero of those failures were caught by the judges.

The root cause was simple: the judges evaluated code structure — tests pass, patterns are correct, documentation is complete. They never deployed. CDK synth passes but deploy fails because synth doesn't validate API-level constraints (ARN formats, account quotas, schema evolution with newer AWS services).

This triggered three changes:

  1. Maturity levels — dev ("can it deploy?"), staging ("does it work E2E?"), prod ("is it resilient?"). Each level focuses judges on the right criteria.
  2. Deployment grounding — J5 confidence is capped based on actual deployment state. Never deployed? Cap at 0.70. Last deploy failed? Cap at 0.75. Deployed but no soak test? Cap at 0.85.
  3. Retroactive labeling — all runs before Run 24 were relabeled as prod maturity. Scores across maturity levels aren't comparable.

After switching to staging maturity, the score dropped from 9.21 to 7.67. That drop was not regression — it was honesty.

A 9.0 score that can't deploy is worse than a 7.0 score that's grounded in reality.

Phase-by-Phase Execution

Early on, I tried running all fix phases in parallel — documentation, schemas, tests, exploration, gaps, knowledge — as a batch. The problem: you get a wall of results at the end with no ability to steer.

The fix was sequential phase-by-phase execution with commits between each phase:

Documentation phase — Fix staleness in reports, specs, skill references. This runs first because judges score against documentation. Stale docs drag J1 and J2 down regardless of code quality.

Schema phase — Fix type safety gaps. Remove .passthrough() from Zod schemas, add validation to unvalidated Lambda handlers, ensure frontend API methods use shared schemas.

Tests phase — Add missing tests for the schema changes. Each new Zod validation gets at least 3 tests (valid input, malformed input, missing required fields).

Exploration phase — Deep investigation of architectural issues surfaced by J3. This is where the unbounded while(true) tool-use loop was found and capped with MAX_TOOL_ROUNDS=10.

Gaps identification phase — Broader scanning beyond what judges flagged. Searches for patterns like unvalidated JSON.parse, missing error handling, dead code.

Knowledge phase — Update skills and knowledge base with learnings from this iteration.

The benefit is visibility. After each phase commit, I can see what changed and decide whether to continue or adjust direction. Findings from earlier phases inform later ones — a schema fix in phase 2 might surface a test gap that phase 3 catches.

The Self-Improving Knowledge Base

The readiness judge system maintains three knowledge files that evolve with every run:

Criteria

Rubrics for each scoring dimension. These started generic and have been refined through 31 runs. Example evolution:

  • Seed: "Measures how well claims are grounded in verifiable evidence"
  • After Run 12: "When evaluating platforms with separate readiness reports, J1 should distinguish 'artifact accuracy' from 'codebase accuracy'"
  • After Run 15: "Above 8.0, J3 becomes the constraining dimension. The gap between 'feature implemented' and 'feature resilient' is where the remaining points live"

Blind Spots

Patterns that judges frequently miss. Each entry tracks how many times it was triggered and has a clean_runs counter. After 5 consecutive clean runs, a blind spot can be archived. Currently tracking entries like:

  • BS-008: Conflating designed vs implemented (4 hits) — "features in specs treated as implemented when code doesn't exist"
  • BS-041: Readiness report drift (4 hits, Critical Watch) — "code improves faster than reports are updated"
  • BS-003: Metric fabrication (2 hits) — "LLMs fill in plausible-sounding metrics that aren't computed from data"

Findings Log

A historical record of every run with scores, changes, and key insights. This is what enables trend analysis — you can see how the score evolved from 6.61 (Run 8, first 5-panel run) through the 9.21 peak (Run 23, ungrounded), the correction to 7.67 (Run 24, staging maturity), and the current 8.18 (Run 31, with deployment grounding).

The self-improvement cycle runs on every execution, not just failures:

After each run:
  1. Check if any blind spots were triggered → increment hit count
  2. Check for new patterns that should be blind spots → add them
  3. Update criteria with calibration notes from scoring disagreements
  4. Log the run with scores, changes, and key insights

Skill Refinement: Preventing Knowledge Loss

When you work on a project across many sessions, hard-won patterns get lost between conversations. A fix you applied in session 12 might be re-discovered (or worse, un-done) in session 15.

The solution is mandatory skill refinement after every archive. The project maintains 16 skills covering all domains — from code-quality (Biome, hardening rules) to telephony-platform (Connect flows, Wisdom agent patterns) to readiness-judge (evaluation criteria, blind spots).

After completing any OpenSpec change, the archive step triggers a skill audit:

  1. "What did we learn from this change that should go into skills?"
  2. Research agents audit each affected skill
  3. Updates applied in parallel (one agent per skill)
  4. Skill updates committed

This is how patterns like "minimum 1024 MB for all Lambdas — no exceptions" survive across sessions. It's in the code-quality skill, checked on every Lambda creation.

Pre-Flight Staleness Sweep

Before judges run, a systematic scan fixes stale artifacts. This sounds minor but was the solution to a persistent problem: documentation drift caused score volatility.

In Run 12, the code was genuinely better — monitoring coverage went from 21/29 to 30/30 Lambdas, test count jumped from 512 to 873, Cognito auth was fixed. But the readiness report still said "21/29 monitoring" and "512 tests." J1 correctly penalized the stale report, and the score decreased despite real improvements.

The pre-flight sweep now checks:

  • Test counts vs actual pnpm test output
  • Stack counts vs actual CDK templates
  • Lambda counts vs code
  • Readiness report metrics vs verification commands
  • Skill knowledge files for outdated code references
  • Memory files for stale project status
  • Deployment report dates and outcomes

Everything is fixed before judges see it. This eliminated the "stale docs penalty" that caused J1/J2 to oscillate.

What Didn't Work

Not everything I tried was successful:

Parallel agent batches — Running all fix phases simultaneously. Too much context pollution, no ability to steer, and later phases couldn't benefit from earlier findings.

Prod maturity criteria too early — Scoring against production criteria (circuit breakers, retry jitter, ops runbooks) when the platform had never been deployed. This produced artificially high scores that meant nothing.

Judge panel without deployment grounding — The 9.21 disaster. Five judges scoring code quality gave a false sense of readiness. Adding the deployment cap was the single most important improvement.

Updating reports as a separate step — Documentation drift is the persistent drag on readiness scores. Reports should be updated as part of implementation commits, not only during readiness runs. I learned this the hard way over multiple runs where J1 scored 2+ points below J5 because the code was better than the docs claimed.

The Current State

Here's where the platform sits today:

MetricValue
CDK stacks15 (all deployed, 9 updated on last deploy)
Lambda handlers30, all 1024 MB minimum, X-Ray traced
Zod-validated boundariesAll 30 handlers
Tests833 passing (691 infra + 142 frontend)
Verify checks24/24 PASS (16 health + 7 scenario + 1 round-trip)
CloudWatch alarms22 across P0/P1/P2
Readiness score8.18/10 (confidence 0.85, staging maturity)
OpenSpec changes26 archived, 0 active
Blind spots tracked66 total, ~10 active
Judge runs31 (across 2 weeks)

Architecture

The platform runs on a Wisdom-first conversation model — a single Amazon Q in Connect (Wisdom) agent handles all AI conversations, with no Lambda in the call path. This gives unlimited conversation time (the previous architecture was constrained by Connect's 8-second Lambda timeout).

Six MCP tools are registered with the agent: knowledge base search, outcome reporting, document upload/status/query, and escalation. The routing layer uses TransferToFlow patterns so phone numbers stay stable while implementation flows iterate.

What's Blocking

Three things are outside the code:

  1. AWS support case — a regional API issue (KB association returns 500 in eu-central-1) that requires manual console configuration as a workaround
  2. Telephony provider connectivity — inbound call routing needs provider-side setup. The original UK DID number never received inbound calls despite correct Connect configuration — we switched to a UK Toll-Free number (+44 808 812 7178) which uses multi-carrier SOMOS routing and should be more reliable
  3. Potential code gap — there's a possibility that the conversation flow has a bug that our verification (unit tests, deploy checks, scenario validation) didn't catch. The platform has never received a real phone call.

The third one is the most concerning. The readiness system is designed to catch structural issues, but it can't simulate a real caller speaking to the AI through Connect's media pipeline. That's the gap between staging maturity (8.18) and production maturity — you need actual traffic.

The Meta-Lesson

Building the readiness judge system took significant effort — 31 runs, 3 knowledge files, 66 blind spots, maturity levels, deployment grounding, staleness sweeps. Was it worth it?

Yes, but not for the reason I expected.

The value isn't in the final score. A number like 8.18 is useful for tracking trends, but the real value is in what the system forces you to do:

  • Verify your assumptions — J1 catches claims that sound right but aren't backed by evidence
  • Check your coverage — J2 finds the section you forgot to update after a refactor
  • Challenge your design — J3 asks "would this actually work under adversarial conditions?"
  • Test your interfaces — J4 catches the aria-label you missed and the API response you didn't validate
  • Ground in reality — J5 asks "have you actually deployed this?" and caps your confidence if you haven't

Each of these is a question I should have been asking myself all along. The system automates the discipline.

The autonomous loop is the same idea at a larger scale: find issues, fix them, verify the fix, learn from it, repeat. The phases prevent overwhelm. The commits prevent lost work. The knowledge base prevents re-learning.

When I come back to this project — after the support case resolves, after the telephony provider sorts out connectivity — I'll have 16 skills, 38 specs, 31 runs of findings, and a readiness system that knows what to check. That's the real output of this work: not just a deployed platform, but a system for maintaining it.

The best time to build your verification infrastructure is before you need it. The second best time is after your 9.21-scoring code fails 16 times on first deploy.

Related Projects