PAPER-2025-002

Harness Agent SDK Migration: Empirical Analysis

Security, Reliability, and Cost Improvements Through Explicit Tool Permissions

Case Study - 12 min read - Intermediate

Abstract

This case study the migration of the CREATE Something Harness from legacy headless mode patterns to Agent SDK best practices. We analyze the trade-offs between security, reliability, and operational efficiency, drawing from empirical observation of a live Canon Redesign project (21 features across 19 files). The migration replaces --dangerously-skip-permissions with explicit --allowedTools, adds runaway prevention via --max-turns, and enables cost tracking through structured JSON output parsing.

21/21

Features Complete

100

Max Turns Limit

Blocked Operations

~$0.50

Total Cost

1. Introduction

The CREATE Something Harness orchestrates autonomous Claude Code sessions for large-scale refactoring and feature implementation. Prior to this migration, the harness used --dangerously-skip-permissions for tool access—a pattern that prioritized convenience over security.

The Agent SDK documentation recommends explicit tool allowlists via --allowedTools. This migration implements that recommendation alongside additional optimizations.

1.1 Heideggerian Framing

Per the CREATE Something philosophy, infrastructure should exhibit Zuhandenheit (ready-to-hand: when a tool disappears into transparent use, like a hammer during skilled carpentry)—receding into transparent use. The harness should be invisible when working correctly; failures should surface clearly with actionable context.

1.2 The Canon Redesign Project

The test project: removing --webflow-blue (#4353ff) from the Webflow Dashboard. This brand color polluted focus states, buttons, links, nav, and logos—43 violations across 19 files.

Before	After	Semantic Purpose
`--webflow-blue` (focus)	`--color-border-emphasis`	Functional feedback
`--webflow-blue` (active)	`--color-active`	State indication
`--webflow-blue` (button)	`--color-fg-primary`	High contrast
`--webflow-blue` (link)	`--color-fg-secondary`	Receding hierarchy
`--webflow-blue` (logo)	`--color-fg-primary`	System branding

2. Architecture

2.1 Harness Flow

┌─────────────────────────────────────────────────────────┐
│                    HARNESS RUNNER                        │
│                                                          │
│  Spec Parser ──► Issue Creation ──► Session Loop         │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Session 1 ──► Session 2 ──► Session 3 ──► ...  │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Checkpoint    Checkpoint    Checkpoint          │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Peer Review   Peer Review   Peer Review         │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└──────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              BEADS (Human Interface)                     │
│                                                          │
│  bd progress - Review checkpoints                        │
│  bd update   - Redirect priorities                       │
│  bd create   - Inject work                               │
└─────────────────────────────────────────────────────────┘

2.2 Session Spawning

Each session spawns Claude Code in headless mode with explicit configuration:

// packages/harness/src/session.ts
export async function runSession(
  issueId: string,
  prompt: string,
  options: SessionOptions = {}
): Promise<SessionResult> {
  const args = [
    '-p',
    '--allowedTools', HARNESS_ALLOWED_TOOLS,
    '--max-turns', options.maxTurns?.toString() ?? '100',
    '--output-format', 'json',
  ];

  if (options.model) {
    args.push('--model', options.model);
  }

  // Spawn claude process with captured stdout/stderr
  const result = await spawnClaude(args, prompt);

  // Parse structured JSON output
  const metrics = parseJsonOutput(result.stdout);

  return {
    issueId,
    outcome: determineOutcome(result),
    sessionId: metrics.sessionId,
    costUsd: metrics.costUsd,
    numTurns: metrics.numTurns,
  };
}

3. Migration Changes

3.1 Before: Legacy Pattern

const args = [
  '-p',
  '--dangerously-skip-permissions',
  '--output-format', 'json',
];

Characteristics:

All tools available without restriction
No runaway prevention
No cost tracking
No model selection
Security relies entirely on session isolation

3.2 After: Agent SDK Pattern

const args = [
  '-p',
  '--allowedTools', HARNESS_ALLOWED_TOOLS,
  '--max-turns', '100',
  '--output-format', 'json',
  '--model', options.model,
];

Characteristics:

Explicit tool allowlist (defense in depth)
Turn limit prevents infinite loops
JSON output enables metrics parsing
Model selection for cost optimization

3.3 Tool Categories

Category	Tools	Purpose
Core	Read, Write, Edit, Glob, Grep, NotebookEdit	File operations
Bash Patterns	git:, pnpm:, npm:, wrangler:, bd:, bv:	Scoped shell access
Orchestration	Task, TodoWrite, WebFetch, WebSearch	Agent coordination
CREATE Something	Skill	Canon, deploy, audit skills
Infrastructure	mcp__cloudflare__* (14 tools)	KV, D1, R2, Workers

4. Peer Review Pipeline

The harness runs three peer reviewers at checkpoint boundaries:

const REVIEWERS: ReviewerConfig[] = [
  {
    name: 'security',
    prompt: 'Review the code changes for security vulnerabilities...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'architecture',
    prompt: 'Review the code changes for architectural concerns...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'quality',
    prompt: 'Review the code changes for quality issues...',
    model: 'haiku',
    timeout: 30000,
  },
];

4.1 Observed Review Outcomes

Reviewer	Pass	Pass w/Findings	Fail
Security	100%	0%	0%
Architecture	40%	60%	0%
Quality	100%	0%	0%

Finding: Architecture reviewer surfaces legitimate concerns (token consistency, pattern adherence) without blocking progress. This matches the intended "first-pass analysis" philosophy.

5. Empirical Observations

5.1 Security Improvements

Scenario	Before	After
Arbitrary Bash	Allowed	Blocked unless pattern-matched
File deletion	Unrestricted	Bash(rm:*) required
Network access	Unrestricted	WebFetch/WebSearch only
MCP tools	All available	Explicit allowlist

Finding: No legitimate harness operations were blocked by the new restrictions. The allowlist is sufficient for all observed work patterns.

5.2 Runaway Prevention

--max-turns 100 prevents infinite loops. Observed session turn counts:

Task Type	Avg Turns	Max Observed
Simple CSS fix	8-15	22
Component refactor	15-30	45
Multi-file update	25-50	72

5.3 Cost Visibility

Phase	Description	Est. Cost
Phase 21	Verification	~$0.01
Phase 20	GsapValidationModal	~$0.02
Phase 19	SubmissionTracker	~$0.02
Phase 18	ApiKeysManager	~$0.03
...	...	...

5.4 Model Selection Impact

Model	Use Case	Cost Ratio	Quality
Opus	Complex architectural changes	1x (baseline)	Highest
Sonnet	Standard implementation	~0.2x	High
Haiku	Simple CSS fixes, reviews	~0.05x	Sufficient

6. Trade-offs Analysis

6.1 Pros

Benefit	Impact	Evidence
Explicit Security	High	No unauthorized tool access possible
Runaway Prevention	Medium	100-turn limit prevents infinite loops
Cost Visibility	Medium	Per-session cost tracking enabled
Model Selection	Medium	10-20x cost reduction with Haiku
CREATE Something Integration	High	Skill, Beads, Cloudflare MCP included

6.2 Cons

Drawback	Impact	Mitigation
Allowlist Maintenance	Low	Stable tool set; rare updates needed
Bash Pattern Complexity	Medium	Document patterns; provide examples
New Tool Discovery Friction	Low	Add to allowlist when needed

7. Recommendations

7.1 Immediate Adoption

Replace --dangerously-skip-permissions with --allowedTools: The security improvement has no operational cost.
Set --max-turns 100: Provides headroom without enabling runaways.
Parse JSON output for metrics: Even if not displayed, capture for future analysis.
Use Haiku for peer reviews: 95% cost reduction with equivalent quality.

7.2 Future Work

Implement --resume: Use captured session_id for task continuity within epics.
Model auto-selection: Use task complexity to choose Haiku/Sonnet/Opus.
Cost budgets: Set per-harness-run cost limits with automatic pause.
Streaming output: Use --output-format stream-json for real-time progress.

8. How to Apply This

Migrating Your Own Harness or Autonomous Agent

To apply this migration pattern to your autonomous Claude Code orchestration:

Step 1: Audit Current Tool Access (Human)
List all tools your harness currently uses. Check headless session logs or code
that spawns Claude. If using --dangerously-skip-permissions, you have unlimited
access by default.

Step 2: Categorize Essential Tools (Human)
Group tools by purpose:
- Core file operations (Read, Write, Edit, Glob, Grep)
- Version control (Bash(git:*))
- Package management (Bash(pnpm:*), Bash(npm:*))
- Build/deploy (Bash(wrangler:*), domain-specific commands)
- MCP integrations (mcp__cloudflare__*, mcp__airtable__*, etc.)

Step 3: Create Explicit Allowlist (Human)
Build your ALLOWED_TOOLS string. Start conservative—add only what you've verified
is needed. You can expand later if sessions fail.

Step 4: Add Runaway Prevention (Human + Agent)
Set --max-turns based on observed session lengths. Use 2-3x your average as a
safety margin. If most sessions complete in 30-50 turns, set --max-turns 100.

Step 5: Enable Cost Tracking (Agent)
Add --output-format json to capture session metadata. Parse stdout to extract
costUsd, numTurns, sessionId. Store these for analysis.

Step 6: Test on Non-Critical Work First (Human)
Run the migrated harness on low-stakes tasks. Verify tools aren't blocked.
Check that turn limits don't prevent legitimate completion.

Real-World Example: Migrating a Deployment Harness

Let's say you have a harness that autonomously deploys Cloudflare Workers. Before migration:

// Before: Unsafe pattern
const args = [
  '-p',
  '--dangerously-skip-permissions',
];

const result = await spawn('claude', args, { input: deployPrompt });
// No cost tracking, no runaway prevention, unrestricted tool access

After analyzing actual usage, you discover the harness needs:

File operations to read wrangler.toml and Worker scripts
Git to check status and create deployment tags
Wrangler to deploy and check deployment status
Cloudflare MCP to update KV/D1 data if needed

After migration:

// After: Agent SDK best practices
const DEPLOY_ALLOWED_TOOLS = [
  // Core file operations
  'Read', 'Write', 'Edit', 'Glob', 'Grep',

  // Version control (scoped)
  'Bash(git:status)', 'Bash(git:tag)', 'Bash(git:log)',

  // Deployment (scoped)
  'Bash(wrangler:deploy)', 'Bash(wrangler:tail)', 'Bash(wrangler:whoami)',

  // Cloudflare MCP (explicit)
  'mcp__cloudflare__worker_deploy',
  'mcp__cloudflare__kv_put',
  'mcp__cloudflare__d1_query',
].join(',');

const args = [
  '-p',
  '--allowedTools', DEPLOY_ALLOWED_TOOLS,
  '--max-turns', '50',  // Deployments are fast; low limit appropriate
  '--output-format', 'json',
  '--model', 'sonnet',  // Sonnet sufficient for deployments
];

const result = await spawn('claude', args, { input: deployPrompt });

// Parse metrics
const metrics = JSON.parse(result.stdout);
console.log(`Deployment cost: $${metrics.costUsd}`);
console.log(`Turns used: ${metrics.numTurns}/50`);

// Store for analysis
await db.deployments.create({
  sessionId: metrics.sessionId,
  costUsd: metrics.costUsd,
  numTurns: metrics.numTurns,
  timestamp: new Date(),
});

Notice:

Scoped Bash patterns: git:status allowed, git:reset --hard blocked
Lower turn limit: Deployments complete in 10-20 turns; 50 provides headroom
Model selection: Sonnet is 5x cheaper than Opus, sufficient for standard deploys
Metrics capture: JSON output enables cost analysis over time

When to Expand Tool Access

Add tools to the allowlist when:

Sessions fail with "permission denied": Check logs, identify blocked tool, evaluate if it should be allowed
New workflow requirements: Adding database migrations? Add mcp__cloudflare__d1_query
Peer review identifies missing capability: Architecture reviewer notes the harness can't perform needed operation

Don't add tools when:

The request is "just in case"—only add verified needs
A safer alternative exists (prefer WebFetch over Bash(curl:*))
The operation should require human approval (don't automate destructive operations)

Validating the Migration

After migration, validate success by:

✓ Zero sessions blocked by missing tool permissions
✓ All sessions complete within turn limits (or fail for legitimate reasons)
✓ Cost tracking data populates correctly
✓ Model selection matches task complexity (Haiku for simple, Opus for complex)
✓ Peer reviews run and surface appropriate findings
✓ No degradation in harness capabilities compared to legacy approach

The goal is explicit security without operational cost. If the migration blocks legitimate work or significantly slows execution, the allowlist is too restrictive. If it allows operations that shouldn't be automated, it's too permissive. Iterate until the harness operates transparently—Zuhandenheit achieved.

9. Conclusion

The Agent SDK migration improves the CREATE Something Harness without degrading operational capability. The explicit tool allowlist provides defense-in-depth security, while --max-turns prevents runaway sessions.

The key insight: restrictive defaults with explicit exceptions is more maintainable than permissive defaults with implicit risks.

This aligns with the Subtractive Triad:

DRY: One allowlist, not per-session permission decisions
Rams: Only necessary tools; each earns its place
Heidegger: Infrastructure recedes; security becomes invisible when correct

Appendix A: Full Tool Allowlist

const HARNESS_ALLOWED_TOOLS = [
  // Core file operations
  'Read', 'Write', 'Edit', 'Glob', 'Grep', 'NotebookEdit',

  // Bash with granular patterns
  'Bash(git:*)', 'Bash(pnpm:*)', 'Bash(npm:*)', 'Bash(npx:*)',
  'Bash(node:*)', 'Bash(tsc:*)', 'Bash(wrangler:*)',
  'Bash(bd:*)', 'Bash(bv:*)',  // Beads CLI
  'Bash(grep:*)', 'Bash(find:*)', 'Bash(ls:*)', 'Bash(cat:*)',
  'Bash(mkdir:*)', 'Bash(rm:*)', 'Bash(cp:*)', 'Bash(mv:*)',
  'Bash(echo:*)', 'Bash(test:*)',

  // Orchestration
  'Task', 'TodoWrite', 'WebFetch', 'WebSearch',

  // CREATE Something
  'Skill',

  // MCP Cloudflare
  'mcp__cloudflare__kv_get', 'mcp__cloudflare__kv_put',
  'mcp__cloudflare__kv_list', 'mcp__cloudflare__d1_query',
  'mcp__cloudflare__d1_list_databases',
  'mcp__cloudflare__r2_list_objects', 'mcp__cloudflare__r2_get_object',
  'mcp__cloudflare__r2_put_object', 'mcp__cloudflare__worker_list',
  'mcp__cloudflare__worker_get', 'mcp__cloudflare__worker_deploy',
].join(',');