Spec-Driven Development with AI: OpenSpec and the Context Engineering Kit
The Problem with "Just Build It"
The telephony platform I recently built has 4 CDK stacks, 4 Lambda handlers, a TanStack Start dashboard with 9 routes, a shared domain package with 22 Zod schemas, and a single-table DynamoDB design. That's a lot of surface area.
When I started, I had a rough idea: "AI voice agents on Amazon Connect." That's not enough to start coding. Without structure, you end up in a loop — build something, realize the data model doesn't support the access pattern you need, rip it out, start over.
I needed two things: a way to think through the design before writing code, and a way to break the work into pieces small enough to execute cleanly. That's where OpenSpec and the Context Engineering Kit come in.
OpenSpec: From Idea to Implementation Plan
OpenSpec is a spec-driven workflow I use for any non-trivial change. It has seven phases, and each one produces a concrete artifact.
Phase 1: Proposal
The proposal answers "why are we doing this?" and "what changes?" It's deliberately short — a page at most. For the telephony project:
- Why: Build an AI-powered voice agent call center as a showcase project
- What changes: New
projects/telephony/workspace with shared models, CDK infra, and dashboard - Affected workspaces: New telephony packages, plus management account for OUs
The point isn't to be exhaustive. It's to force yourself to articulate the value before diving into architecture. If you can't write a clear proposal, you're not ready to build.
Phase 2: Design
The design document captures architectural decisions with rationales and alternatives considered. This is where trade-offs live:
| Decision | Chosen | Why | Alternative |
|---|---|---|---|
| Telephony | Amazon Connect | Managed PSTN, built-in Polly TTS, EventBridge integration | Twilio (more flexible, but more to manage) |
| AI | Amazon Bedrock | Claude models, native tool use, no self-hosted infra | SageMaker (more control, much more ops work) |
| Data | DynamoDB single-table | All access patterns served by one table + one GSI | RDS (overkill for key-value patterns) |
| Auth | Cognito | Managed OAuth2, integrates with API Gateway authorizers | Auth0 (better DX, another vendor) |
Each decision is one line, but it captures months of future "wait, why did we do it this way?"
The design also includes a directory structure, cost model, and open questions. Open questions are important — they're the things you don't know yet and shouldn't pretend you do.
Phase 3: Specs
Specs define observable behavior using Given/When/Then format with RFC 2119 keywords (MUST, SHOULD, MAY). They describe what the system does, not how it's implemented:
Given a caller dials a claimed phone number
When Amazon Connect receives the inbound call
Then the contact flow MUST invoke the AI Conversation Lambda
And the Lambda MUST send the caller's utterance to Bedrock
And the Lambda MUST return the response text within 8 seconds
Specs are the source of truth. If the code doesn't match the spec, the code is wrong. If the spec is wrong, update the spec first, then the code.
For the telephony project, I wrote specs for six capabilities: Connect infrastructure, voice agent, call analytics, dashboard, integrations (webhooks), and CDK infra. Each capability gets its own spec file.
Phase 4: Tasks
Tasks break specs into implementable units, grouped by phase and sized for estimation:
- Phase A: Setup — scaffolding, workspace registration (XS tasks)
- Phase B: Foundation — CDK stacks, DynamoDB table, API Gateway (M tasks)
- Phase C: Features — Lambda handlers, Bedrock integration, tool use (L tasks)
- Phase D: Dashboard — routes, components, forms (M-L tasks)
- Phase E: Integration — webhooks, management account OUs (S-M tasks)
- Phase F: Quality — tests, linting, verification (M tasks)
The telephony project had 26 top-level tasks with 98 subtasks. Each task has explicit dependencies ("3.4 Create Contact Flow" requires "2.2 AiStack" because the flow needs the Lambda ARN) and acceptance criteria linked back to spec scenarios.
The goal is tasks small enough to complete without losing context, ordered so you never hit a blocker you haven't planned for.
Phases 5-7: Implement, Verify, Archive
Implementation follows the task order. Verification runs pnpm check — the full pipeline
of knip, biome, typecheck, tests, and CDK synth. Archive marks the change as complete.
The Context Engineering Kit: Research Before You Build
OpenSpec tells you what to build and in what order. CEK (the Context Engineering Kit) helps you figure out how — it's the research layer.
Scratchpads
Before writing any code, I create scratchpads — unstructured research documents that capture raw thinking. For the telephony project, I ended up with seven:
Business analysis scratchpad — Problem definition, actors (caller, agent, admin), data entities (conversations, outcomes, phone mappings), constraints, and ambiguities. This is where I caught things like "should Connect instances be per-tenant or shared?" before they became architectural dead ends.
Technical research scratchpad — Raw AWS API exploration. Connect's Lambda integration
has an 8-second timeout. Bedrock's Converse API returns different stopReason values
depending on whether it wants to use a tool or end the turn. DynamoDB's single-table
design requires knowing all access patterns upfront. This scratchpad captures the facts
that drive design decisions.
Code explorer scratchpad — Walkthrough of the existing codebase to identify reusable
patterns. The portfolio project already had a SsrSite CDK construct, Zod config
validation, cross-region certificate handling, and delegated DNS. The telephony project
could reuse all of these instead of reinventing them.
Architecture scratchpad — Solution strategy thinking. Why a thin wrapper over managed services? Why a single DynamoDB table? What's the conversation lifecycle? How do stacks depend on each other? This is where the "Connect needs the Lambda ARN, but the Lambda needs the Connect instance ARN" circular dependency was identified and resolved (deploy AiStack first, use the instance alias as a stand-in).
Decomposition scratchpad — Phase breakdown, task sizing, dependency graph, and risk register. Each risk gets a severity (HIGH/MEDIUM/LOW) and a mitigation. "Bedrock timeout exceeds Connect's 8-second limit" → "Cache agent config in module scope, keep conversation payloads small."
Parallelization scratchpad — Maps 98 subtasks into 9 waves of parallel work, identifying the critical path (9 sequential steps) and assigning task complexity to appropriate tools.
Analysis Files
After the scratchpad research, I write an analysis file — a structured document that translates research into a concrete implementation plan. It lists every file to create, with purpose and key decisions:
## @dawalnut/telephony-shared
### src/constants.ts
- DynamoDB key prefixes (TENANT#, AGENT#, CALL#, PHONE#)
- EventBridge event types
- TTL configuration
### src/schemas.ts
- 22 Zod schemas covering all domain entities
- Discriminated unions for message roles and webhook auth types
- E.164 phone number validation
The analysis file is the bridge between thinking and doing. Once it exists, implementation is largely mechanical — you know what to build, where to put it, and what patterns to follow.
How They Work Together
Fuzzy idea
↓
OpenSpec Proposal ← Business analysis scratchpad
↓
OpenSpec Design ← Architecture + research scratchpads
↓
OpenSpec Specs ← (Written from design decisions)
↓
OpenSpec Tasks ← Decomposition + parallelization scratchpads
↓
Analysis File ← (Structured output from all research)
↓
Implementation ← Code explorer scratchpad (for pattern reference)
↓
Verification ← pnpm check (knip → biome → tsc → vitest → cdk synth)
The key insight is that research and planning are not overhead — they're how you avoid rework. Every hour spent in scratchpads saved multiple hours of "this data model doesn't work, let me redesign."
What This Looks Like in Practice
Here's a concrete example. The AI Conversation Lambda needs to:
- Load agent config (with caching)
- Load conversation history from DynamoDB
- Call Bedrock with the full message history
- Handle tool use loops
- Persist updated history with TTL
- Publish EventBridge events
Without CEK research, I'd discover the 8-second timeout constraint mid-implementation, realize the agent config lookup adds latency, add a cache, discover the cache is unbounded, and so on.
With CEK, the research scratchpad already documented the timeout constraint. The architecture scratchpad designed the module-scope cache. The decomposition scratchpad sized the task as "L" (4-8 hours) with explicit subtasks. By the time I started writing code, every decision was pre-made.
Why This Works with AI-Assisted Development
This workflow was designed for working with AI coding assistants. Here's why it fits:
Context is everything. AI assistants work best when they have full context — what the system should do, what patterns to follow, what constraints to respect. OpenSpec specs and CEK analysis files provide exactly that. Instead of "build me a Lambda," the prompt becomes "implement task C.2 per the spec, following the patterns in the code explorer scratchpad, respecting the constraints in the research scratchpad."
Small tasks prevent drift. AI assistants lose coherence on large, ambiguous tasks. By decomposing into 98 sized subtasks with explicit dependencies and acceptance criteria, each task is scoped enough to complete cleanly.
Scratchpads capture institutional knowledge. When you're iterating with an AI across multiple sessions, scratchpads persist the research. Session 3 doesn't need to re-discover that Connect has an 8-second timeout — it's in the scratchpad.
Specs catch regressions. When an AI modifies code, the spec defines what "correct" looks like. If a change breaks a spec scenario, you catch it in verification.
The Results
The telephony platform went from "fuzzy idea" to "passing pnpm check" with:
- 7 scratchpads of research (business analysis, technical research, code exploration, architecture, decomposition, parallelization, QA)
- 1 analysis file with a complete implementation manifest
- 6 spec files defining observable behavior
- 26 tasks with 98 subtasks across 6 phases
- 49 source files across 3 workspaces
- 254 passing tests (98 shared + 113 infra + 18 theme + 25 portfolio)
- Zero rework on the data model, stack architecture, or key design decisions
The specs are still there, still the source of truth. When I refactored the Lambda handlers to use shared utilities and key builders, the specs told me exactly what behavior to preserve.
Plan the work, then work the plan. It's not glamorous, but it works — especially when your pair programmer has a context window.