PAPER-2025-002

Harness Agent SDK Migration: Empirical Analysis

Security, Reliability, and Cost Improvements Through Explicit Tool Permissions

Case Study - 12 min read - Intermediate

Abstract

This case study the migration of the CREATE Something Harness from legacy headless mode patterns to Agent SDK best practices. We analyze the trade-offs between security, reliability, and operational efficiency, drawing from empirical observation of a live Canon Redesign project (21 features across 19 files). The migration replaces --dangerously-skip-permissions with explicit --allowedTools, adds runaway prevention via --max-turns, and enables cost tracking through structured JSON output parsing.

21/21
Features Complete
100
Max Turns Limit
0
Blocked Operations
~$0.50
Total Cost

1. Introduction

The CREATE Something Harness orchestrates autonomous Claude Code sessions for large-scale refactoring and feature implementation. Prior to this migration, the harness used --dangerously-skip-permissions for tool access—a pattern that prioritized convenience over security.

The Agent SDK documentation recommends explicit tool allowlists via --allowedTools. This migration implements that recommendation alongside additional optimizations.

1.1 Heideggerian Framing

Per the CREATE Something philosophy, infrastructure should exhibit Zuhandenheit (ready-to-hand: when a tool disappears into transparent use, like a hammer during skilled carpentry)—receding into transparent use. The harness should be invisible when working correctly; failures should surface clearly with actionable context.

1.2 The Canon Redesign Project

The test project: removing --webflow-blue (#4353ff) from the Webflow Dashboard. This brand color polluted focus states, buttons, links, nav, and logos—43 violations across 19 files.

BeforeAfterSemantic Purpose
--webflow-blue (focus)--color-border-emphasisFunctional feedback
--webflow-blue (active)--color-activeState indication
--webflow-blue (button)--color-fg-primaryHigh contrast
--webflow-blue (link)--color-fg-secondaryReceding hierarchy
--webflow-blue (logo)--color-fg-primarySystem branding

2. Architecture

2.1 Harness Flow

┌─────────────────────────────────────────────────────────┐
│                    HARNESS RUNNER                        │
│                                                          │
│  Spec Parser ──► Issue Creation ──► Session Loop         │
│                                                          │
│  ┌─────────────────────────────────────────────────┐    │
│  │  Session 1 ──► Session 2 ──► Session 3 ──► ...  │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Checkpoint    Checkpoint    Checkpoint          │    │
│  │      │             │             │               │    │
│  │      ▼             ▼             ▼               │    │
│  │  Peer Review   Peer Review   Peer Review         │    │
│  └─────────────────────────────────────────────────┘    │
│                                                          │
└──────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│              BEADS (Human Interface)                     │
│                                                          │
│  bd progress - Review checkpoints                        │
│  bd update   - Redirect priorities                       │
│  bd create   - Inject work                               │
└─────────────────────────────────────────────────────────┘

2.2 Session Spawning

Each session spawns Claude Code in headless mode with explicit configuration:

// packages/harness/src/session.ts
export async function runSession(
  issueId: string,
  prompt: string,
  options: SessionOptions = {}
): Promise<SessionResult> {
  const args = [
    '-p',
    '--allowedTools', HARNESS_ALLOWED_TOOLS,
    '--max-turns', options.maxTurns?.toString() ?? '100',
    '--output-format', 'json',
  ];

  if (options.model) {
    args.push('--model', options.model);
  }

  // Spawn claude process with captured stdout/stderr
  const result = await spawnClaude(args, prompt);

  // Parse structured JSON output
  const metrics = parseJsonOutput(result.stdout);

  return {
    issueId,
    outcome: determineOutcome(result),
    sessionId: metrics.sessionId,
    costUsd: metrics.costUsd,
    numTurns: metrics.numTurns,
  };
}

3. Migration Changes

3.1 Before: Legacy Pattern

const args = [
  '-p',
  '--dangerously-skip-permissions',
  '--output-format', 'json',
];

Characteristics:

  • All tools available without restriction
  • No runaway prevention
  • No cost tracking
  • No model selection
  • Security relies entirely on session isolation

3.2 After: Agent SDK Pattern

const args = [
  '-p',
  '--allowedTools', HARNESS_ALLOWED_TOOLS,
  '--max-turns', '100',
  '--output-format', 'json',
  '--model', options.model,
];

Characteristics:

  • Explicit tool allowlist (defense in depth)
  • Turn limit prevents infinite loops
  • JSON output enables metrics parsing
  • Model selection for cost optimization

3.3 Tool Categories

CategoryToolsPurpose
CoreRead, Write, Edit, Glob, Grep, NotebookEditFile operations
Bash Patternsgit:*, pnpm:*, npm:*, wrangler:*, bd:*, bv:*Scoped shell access
OrchestrationTask, TodoWrite, WebFetch, WebSearchAgent coordination
CREATE SomethingSkillCanon, deploy, audit skills
Infrastructuremcp__cloudflare__* (14 tools)KV, D1, R2, Workers

4. Peer Review Pipeline

The harness runs three peer reviewers at checkpoint boundaries:

const REVIEWERS: ReviewerConfig[] = [
  {
    name: 'security',
    prompt: 'Review the code changes for security vulnerabilities...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'architecture',
    prompt: 'Review the code changes for architectural concerns...',
    model: 'haiku',
    timeout: 30000,
  },
  {
    name: 'quality',
    prompt: 'Review the code changes for quality issues...',
    model: 'haiku',
    timeout: 30000,
  },
];

4.1 Observed Review Outcomes

ReviewerPassPass w/FindingsFail
Security100%0%0%
Architecture40%60%0%
Quality100%0%0%

Finding: Architecture reviewer surfaces legitimate concerns (token consistency, pattern adherence) without blocking progress. This matches the intended "first-pass analysis" philosophy.

5. Empirical Observations

5.1 Security Improvements

ScenarioBeforeAfter
Arbitrary BashAllowedBlocked unless pattern-matched
File deletionUnrestrictedBash(rm:*) required
Network accessUnrestrictedWebFetch/WebSearch only
MCP toolsAll availableExplicit allowlist

Finding: No legitimate harness operations were blocked by the new restrictions. The allowlist is sufficient for all observed work patterns.

5.2 Runaway Prevention

--max-turns 100 prevents infinite loops. Observed session turn counts:

Task TypeAvg TurnsMax Observed
Simple CSS fix8-1522
Component refactor15-3045
Multi-file update25-5072

5.3 Cost Visibility

PhaseDescriptionEst. Cost
Phase 21Verification~$0.01
Phase 20GsapValidationModal~$0.02
Phase 19SubmissionTracker~$0.02
Phase 18ApiKeysManager~$0.03
.........

5.4 Model Selection Impact

ModelUse CaseCost RatioQuality
OpusComplex architectural changes1x (baseline)Highest
SonnetStandard implementation~0.2xHigh
HaikuSimple CSS fixes, reviews~0.05xSufficient

6. Trade-offs Analysis

6.1 Pros

BenefitImpactEvidence
Explicit SecurityHighNo unauthorized tool access possible
Runaway PreventionMedium100-turn limit prevents infinite loops
Cost VisibilityMediumPer-session cost tracking enabled
Model SelectionMedium10-20x cost reduction with Haiku
CREATE Something IntegrationHighSkill, Beads, Cloudflare MCP included

6.2 Cons

DrawbackImpactMitigation
Allowlist MaintenanceLowStable tool set; rare updates needed
Bash Pattern ComplexityMediumDocument patterns; provide examples
New Tool Discovery FrictionLowAdd to allowlist when needed

7. Recommendations

7.1 Immediate Adoption

  1. Replace --dangerously-skip-permissions with --allowedTools: The security improvement has no operational cost.
  2. Set --max-turns 100: Provides headroom without enabling runaways.
  3. Parse JSON output for metrics: Even if not displayed, capture for future analysis.
  4. Use Haiku for peer reviews: 95% cost reduction with equivalent quality.

7.2 Future Work

  1. Implement --resume: Use captured session_id for task continuity within epics.
  2. Model auto-selection: Use task complexity to choose Haiku/Sonnet/Opus.
  3. Cost budgets: Set per-harness-run cost limits with automatic pause.
  4. Streaming output: Use --output-format stream-json for real-time progress.

8. How to Apply This

Migrating Your Own Harness or Autonomous Agent

To apply this migration pattern to your autonomous Claude Code orchestration:

Step 1: Audit Current Tool Access (Human)
List all tools your harness currently uses. Check headless session logs or code
that spawns Claude. If using --dangerously-skip-permissions, you have unlimited
access by default.

Step 2: Categorize Essential Tools (Human)
Group tools by purpose:
- Core file operations (Read, Write, Edit, Glob, Grep)
- Version control (Bash(git:*))
- Package management (Bash(pnpm:*), Bash(npm:*))
- Build/deploy (Bash(wrangler:*), domain-specific commands)
- MCP integrations (mcp__cloudflare__*, mcp__airtable__*, etc.)

Step 3: Create Explicit Allowlist (Human)
Build your ALLOWED_TOOLS string. Start conservative—add only what you've verified
is needed. You can expand later if sessions fail.

Step 4: Add Runaway Prevention (Human + Agent)
Set --max-turns based on observed session lengths. Use 2-3x your average as a
safety margin. If most sessions complete in 30-50 turns, set --max-turns 100.

Step 5: Enable Cost Tracking (Agent)
Add --output-format json to capture session metadata. Parse stdout to extract
costUsd, numTurns, sessionId. Store these for analysis.

Step 6: Test on Non-Critical Work First (Human)
Run the migrated harness on low-stakes tasks. Verify tools aren't blocked.
Check that turn limits don't prevent legitimate completion.

Real-World Example: Migrating a Deployment Harness

Let's say you have a harness that autonomously deploys Cloudflare Workers. Before migration:

// Before: Unsafe pattern
const args = [
  '-p',
  '--dangerously-skip-permissions',
];

const result = await spawn('claude', args, { input: deployPrompt });
// No cost tracking, no runaway prevention, unrestricted tool access

After analyzing actual usage, you discover the harness needs:

  • File operations to read wrangler.toml and Worker scripts
  • Git to check status and create deployment tags
  • Wrangler to deploy and check deployment status
  • Cloudflare MCP to update KV/D1 data if needed

After migration:

// After: Agent SDK best practices
const DEPLOY_ALLOWED_TOOLS = [
  // Core file operations
  'Read', 'Write', 'Edit', 'Glob', 'Grep',

  // Version control (scoped)
  'Bash(git:status)', 'Bash(git:tag)', 'Bash(git:log)',

  // Deployment (scoped)
  'Bash(wrangler:deploy)', 'Bash(wrangler:tail)', 'Bash(wrangler:whoami)',

  // Cloudflare MCP (explicit)
  'mcp__cloudflare__worker_deploy',
  'mcp__cloudflare__kv_put',
  'mcp__cloudflare__d1_query',
].join(',');

const args = [
  '-p',
  '--allowedTools', DEPLOY_ALLOWED_TOOLS,
  '--max-turns', '50',  // Deployments are fast; low limit appropriate
  '--output-format', 'json',
  '--model', 'sonnet',  // Sonnet sufficient for deployments
];

const result = await spawn('claude', args, { input: deployPrompt });

// Parse metrics
const metrics = JSON.parse(result.stdout);
console.log(`Deployment cost: $${metrics.costUsd}`);
console.log(`Turns used: ${metrics.numTurns}/50`);

// Store for analysis
await db.deployments.create({
  sessionId: metrics.sessionId,
  costUsd: metrics.costUsd,
  numTurns: metrics.numTurns,
  timestamp: new Date(),
});

Notice:

  • Scoped Bash patterns: git:status allowed, git:reset --hard blocked
  • Lower turn limit: Deployments complete in 10-20 turns; 50 provides headroom
  • Model selection: Sonnet is 5x cheaper than Opus, sufficient for standard deploys
  • Metrics capture: JSON output enables cost analysis over time

When to Expand Tool Access

Add tools to the allowlist when:

  • Sessions fail with "permission denied": Check logs, identify blocked tool, evaluate if it should be allowed
  • New workflow requirements: Adding database migrations? Add mcp__cloudflare__d1_query
  • Peer review identifies missing capability: Architecture reviewer notes the harness can't perform needed operation

Don't add tools when:

  • The request is "just in case"—only add verified needs
  • A safer alternative exists (prefer WebFetch over Bash(curl:*))
  • The operation should require human approval (don't automate destructive operations)

Validating the Migration

After migration, validate success by:

✓ Zero sessions blocked by missing tool permissions
✓ All sessions complete within turn limits (or fail for legitimate reasons)
✓ Cost tracking data populates correctly
✓ Model selection matches task complexity (Haiku for simple, Opus for complex)
✓ Peer reviews run and surface appropriate findings
✓ No degradation in harness capabilities compared to legacy approach

The goal is explicit security without operational cost. If the migration blocks legitimate work or significantly slows execution, the allowlist is too restrictive. If it allows operations that shouldn't be automated, it's too permissive. Iterate until the harness operates transparently—Zuhandenheit achieved.

9. Conclusion

The Agent SDK migration improves the CREATE Something Harness without degrading operational capability. The explicit tool allowlist provides defense-in-depth security, while --max-turns prevents runaway sessions.

The key insight: restrictive defaults with explicit exceptions is more maintainable than permissive defaults with implicit risks.

This aligns with the Subtractive Triad:

  • DRY: One allowlist, not per-session permission decisions
  • Rams: Only necessary tools; each earns its place
  • Heidegger: Infrastructure recedes; security becomes invisible when correct

Appendix A: Full Tool Allowlist

const HARNESS_ALLOWED_TOOLS = [
  // Core file operations
  'Read', 'Write', 'Edit', 'Glob', 'Grep', 'NotebookEdit',

  // Bash with granular patterns
  'Bash(git:*)', 'Bash(pnpm:*)', 'Bash(npm:*)', 'Bash(npx:*)',
  'Bash(node:*)', 'Bash(tsc:*)', 'Bash(wrangler:*)',
  'Bash(bd:*)', 'Bash(bv:*)',  // Beads CLI
  'Bash(grep:*)', 'Bash(find:*)', 'Bash(ls:*)', 'Bash(cat:*)',
  'Bash(mkdir:*)', 'Bash(rm:*)', 'Bash(cp:*)', 'Bash(mv:*)',
  'Bash(echo:*)', 'Bash(test:*)',

  // Orchestration
  'Task', 'TodoWrite', 'WebFetch', 'WebSearch',

  // CREATE Something
  'Skill',

  // MCP Cloudflare
  'mcp__cloudflare__kv_get', 'mcp__cloudflare__kv_put',
  'mcp__cloudflare__kv_list', 'mcp__cloudflare__d1_query',
  'mcp__cloudflare__d1_list_databases',
  'mcp__cloudflare__r2_list_objects', 'mcp__cloudflare__r2_get_object',
  'mcp__cloudflare__r2_put_object', 'mcp__cloudflare__worker_list',
  'mcp__cloudflare__worker_get', 'mcp__cloudflare__worker_deploy',
].join(',');

References

"The harness recedes into transparent operation. Review progress. Redirect when needed."