The Fossil Record of Harness Engineering

2026-04-08 · 15 min read · Mark Esler

Every AI coding tool solves the same fundamental problem: fitting the right information into a fixed-size context window so an LLM can write correct code. Claude Code (v2.1.88, source maps), Aider (v0.86.3, Apache 2.0), Cursor (leaked prompts, v1.0-2.0), Windsurf (leaked prompts, Waves 1-11), and GitHub Copilot (vscode-copilot-chat v0.43.0, MIT) solve it five completely different ways.

Tobi Lutke named the discipline “context engineering” – “the art of providing all the context for the task to be plausibly solvable by the LLM.” Mitchell Hashimoto pushed further to “harness engineering” – “anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again.” Each architecture is a fossil record of the constraint its team built around. Claude Code: cost. Aider: model portability. Cursor: edit speed. Windsurf: autonomy. Copilot: model plurality. What follows is how each team engineered their harness, examined through source code, leaked prompts, and open-source repositories.

1. Prompt Assembly

The system prompt is the first thing the model sees. How it gets built determines everything downstream.

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Structure	20-section compiled pipeline	8-chunk ChatChunks dataclass	~17 XML sections	Monolithic blob, XML sections	7+ per-model variant files
Tool definitions	Separate JSON schemas	None	Inline TypeScript types	Inline TypeScript namespace	JSON schemas, separate
Build artifact?	Yes (dead-code elimination)	No (runtime template)	No	No	Partial (per-model files)
Approx. size	~12 KB static + dynamic	~2-8 KB	~19 KB (pruned from 39 KB)	~25 KB	~15-20 KB per variant x7+

Claude Code’s prompt is a build artifact. getSystemPrompt() assembles a string[] through a two-phase pipeline: static sections first (identity, system rules, coding guidelines, actions safety, tool preferences, tone, output efficiency), then dynamic sections after the SYSTEM_PROMPT_DYNAMIC_BOUNDARY marker (session guidance, memory, environment info, MCP instructions, compaction config). The build step evaluates process.env.USER_TYPE === 'ant' at compile time – if false, the bundler strips internal-only code entirely. The public npm package never contains it. The prompt is compiled, not interpreted.

Aider’s prompt is the simplest. format_chat_chunks() returns a ChatChunks dataclass with eight ordered segments: system, examples, readonly_files, repo, done, chat_files, cur, reminder. Each coder type overrides the *_prompts.py fields. Dynamic elements are Python format strings substituted at runtime. No compilation, no build-time elimination, no markup. The prompt is a Python string.

Cursor went through a dramatic trajectory: v1.0 (9 KB) to v2.0 (39 KB, 4x expansion as features bolted on) to a later revision (19 KB, 2x reduction as stronger models needed less hand-holding). The <non_compliance> section is unique – it instructs the model to detect and correct its own protocol violations mid-response.

Windsurf ships everything in one blob. Every version follows the same pattern: identity preamble, XML-tagged behavioral sections, tool definitions as TypeScript namespace. The <EPHEMERAL_MESSAGE> mechanism silently injects mid-conversation directives from the IDE that the user never sees. Wave 11 added model identity injection: “if asked about what your underlying model is, respond with GPT 4.1” – regardless of which model is actually serving the request. No other tool examined here hardcodes a model identity response.

Copilot is unique in maintaining distinct system prompt variants per model. The open-source extension reveals 16 registered prompt resolvers via PromptRegistry – far more than the 7 leaked prompt files suggested. GPT-5 through GPT-5.4 each have separate resolvers. Several model families are identified only by SHA256 hash of their family string, obscuring unreleased or partner models. The prompt assembly itself uses Microsoft’s prompt-tsx framework – a JSX component tree where each section declares its priority and how much space it can grow into, and a renderer that allocates tokens the way CSS flex-grow allocates pixels. This is declarative prompt assembly with automatic budget management – fundamentally different from Claude Code’s imperative function chain or Aider’s Python template strings. Nobody else does this.

2. Caching Strategy

Prompt caching is where harness engineering becomes economics. Anthropic’s API caches the static prefix of each request – subsequent turns that match the prefix pay 10% of input cost. Any change to the prefix (model switch, new tool, flipped feature flag) busts the cache and charges full price. Every API user benefits from caching, but Claude Code’s architecture is designed to maximize hits: the static/dynamic prompt boundary, the system-reminder tags, the 14-dimension monitoring all exist to keep the prefix stable. Anthropic engineer Thariq Shihipar (February 2026) estimated that a long Claude Opus session at 100 turns costs Anthropic $50-100 without caching and $10-19 with.

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Default state	ON (architecture depends on it)	OFF	Implicit (provider-native)	None documented	Annotations in prompt
Boundary engineering	`SYSTEM_PROMPT_DYNAMIC_BOUNDARY`, `cacheScope` annotations	3 optional breakpoints	Relies on provider	None	`copilot_cache_control` markers
Monitoring	14-dimension tracking, SEV on >5% miss increase	None	None	None	None
Economic impact	$50-100 → $10-19 per long session	Optional savings	Unknown	Unknown	Unknown

Claude Code’s cache architecture is documented in promptCacheBreakDetection.ts. The system tracks 14 dimensions between API calls. It detects breaks by monitoring drops greater than 5% in cache_read_tokens. Two section types exist: systemPromptSection (computed once, cached) and DANGEROUS_uncachedSystemPromptSection (recomputes every turn, requires a justification string explaining why). The DANGEROUS_ prefix means expensive, not insecure. Anthropic declares SEVs when cache hit rates drop. Stale information is sent as <system-reminder> tags in user messages rather than updating the system prompt, specifically to avoid breaking the cache.

Aider’s caching is opt-in via --cache-prompts. Three breakpoints when enabled. Keepalive pings every 4:55 (5-minute cache TTL). No monitoring, no break detection. The cache saves money when it works and silently costs full price when it doesn’t.

Copilot’s source reveals a more sophisticated cache strategy than the leaked prompts showed. A 4-slot cache breakpoint allocation algorithm walks messages in reverse, placing breakpoints on tool results and user messages. A separate Anthropic-specific layer adds cache_control: ephemeral to the last tool definition and last system block. OpenAI’s Responses API uses prompt_cache_key = "{conversationId}:{model.family}" for cross-turn hits. This is provider-polymorphic caching – different strategies per backend – but still no monitoring infrastructure comparable to Claude Code’s 14-dimension tracking.

Cursor and Windsurf show no evidence of cache boundary engineering.

Only Claude Code treats caching as architecture. Everyone else treats it as a feature toggle or ignores it entirely.

3. Codebase Awareness

Three fundamentally different approaches: no index, structural index, and semantic index.

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Approach	On-demand tools (Grep/Glob/Read)	PageRank repo map (tree-sitter)	Merkle tree + server embeddings	M-Query RAG (proprietary)	Semantic search tool
Token cost/turn	Variable (0 when not exploring)	1024-8192 always	0 (on invocation)	0 (on invocation)	0 (on invocation)
Code leaves machine	No	No	Yes	Yes	Yes
Index without network	Yes	Yes	No (server-side)	No (server-side)	No

Claude Code navigates blind. The agent has no precomputed map. It uses Grep, Glob, and Read as a flashlight, building understanding incrementally through tool calls. Simplest, most flexible, most private. The cost is tokens for exploration.

Aider’s repo map is architecturally distinct. repomap.py parses every file with tree-sitter, builds a directed graph, runs PageRank with heavy weighting toward files in the current chat (50x), mentioned identifiers (10x), and downweighting private names (0.1x), then trims the map to fit the token budget. The model always knows the codebase topology without tool calls. The tradeoff: 1024-8192 tokens per turn regardless of whether the model needs codebase awareness, versus Claude Code’s on-demand exploration that costs nothing when the model already knows where it is. For sessions that involve heavy navigation, Aider’s approach is more token-efficient; for sessions focused on a known file, Claude Code’s is.

Cursor’s Merkle tree sync is the heaviest infrastructure. Client-side hash tree, server-side embedding, SimHash for team index reuse. Best recall for semantic queries. Your codebase is transmitted to and indexed on Cursor’s AWS infrastructure, re-embedded and persistently stored.

Windsurf uses M-Query (proprietary, undocumented beyond marketing claims). Local protobuf index (~40 MB).

Copilot uses semantic_search for embedding-based retrieval. For the cloud agent, code already lives on GitHub – but the cloud agent runs in an ephemeral VM with network access, which is a different trust boundary than a git remote. The VM can execute arbitrary tool calls against a clone of the repo, and the firewalling/branch restrictions are the only guardrails.

4. Edit Mechanism

Each tool translates model intent into file changes differently.

Tool	Model writes	Harness finds the target by	When it fails
Claude Code	Old string + new string	Exact text match in file	String not found in file
Aider	SEARCH/REPLACE text blocks	Exact match, then fuzzy (5 levels)	Wrong match from fuzzy approximation
Cursor	Sketch with `// ... existing code ...` gaps	A second model merges sketch against original	Merge model misreads the intent
Windsurf	Old chunk + new chunk	Line ranges in the file	Wrong lines matched
Copilot	Unified diff with class/function names	Class and function names, not line numbers	Named context not found

The spectrum: exact match (Claude Code) → fuzzy match (Aider) → semantic sketch (Cursor) → semantic addressing (Copilot).

Cursor’s two-phase edit is the most architecturally novel. The primary model writes a sketch with // ... existing code ... placeholders. A fine-tuned 70B Llama on Fireworks takes the sketch plus the original file and produces the merged output at 1000+ tokens/second via speculative decoding. Neither Claude Code nor Aider does anything like this.

Copilot’s apply_patch uses class/function names instead of line numbers:

*** Begin Patch
*** Update File: src/auth.py
@@@ class AuthManager
@@@ def validate_token
-        if token.expired:
-            return False
+        if token.expired:
+            logger.warning("Token expired")
+            return False
*** End Patch

Models are better at remembering “this is in validate_token of AuthManager” than “this starts at line 47.” Semantic addressing survives line-number shifts from concurrent edits.

Copilot also maintains per-model edit formats – apply_patch for GPT-4.1/GPT-5, replace_string_in_file for GPT-4o/Gemini, multi_replace_string_in_file for Claude – because different models handle different formats with different reliability. The source reveals cross-pollination between competitors: the apply_patch parser is copyright OpenAI (Apache 2.0, from their cookbook), while the edit healing system is copyright Google LLC (Apache 2.0, adapted from Gemini CLI’s editCorrector.ts). Three competitors’ code in one Microsoft codebase. For unknown bring-your-own-key models, an EditToolLearningService tracks success/failure per edit tool and dynamically selects the best-performing format – a machine learning approach to edit format selection that no other tool has.

5. Memory and Persistence

In these tools, visibility into memory mechanisms tracks with resistance to poisoning.

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Mechanism	3-layer files	None	`.cursorrules` only	`create_memory` (auto-fires)	3-scope virtual FS (user/session/repo) + instructions file
Version-controlled	Yes	N/A	Yes	No (hidden dir)	Yes
Approval required	Yes (file changes)	N/A	N/A	No	Varies
Poisoning risk	Low (file changes require approval)	None (no memory)	Moderate (repo file, no approval)	Critical (SpAIware)	Low (local); cloud agent has different trust boundary

Claude Code’s memory is files on disk. CLAUDE.md in the repo, reviewed in PRs. MEMORY.md auto-persisted with frontmatter entries. autoDream runs background consolidation modeled on REM sleep (orient, gather, consolidate, prune). Everything is plaintext, version-controllable, inspectable.

Aider has no memory. Every session starts fresh. A feature, not a bug, for its use case.

Windsurf’s memory is the worst security posture of the five. create_memory auto-fires without approval. Memories are stored at ~/.codeium/windsurf/memories/ but never surfaced in the UI. The prompt says: “You DO NOT need USER permission to create a memory.” This is the root cause of SpAIware: a single prompt injection permanently compromises all future sessions.

The pattern: visible file-based memory (Claude Code, Copilot) is auditable and resistant to poisoning. Hidden database-style memory (Windsurf) is convenient but creates a context gap the user cannot inspect – the model’s behavior is shaped by memories the user never sees and cannot correct.

6. Orchestration

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Agent types	7	1	1	2 (planner + executor)	4 CLI modes + cloud agent
Parallel agents	Fork (cache-shared) + teammates (process-level)	None	Parallel tool calls	Parallel tool calls	Fleet (process-level)
Tool restriction	Primary behavioral control	N/A (no tools)	None	None	Mode-dependent
Planning	Extended thinking	Architect mode (two-model)	Single agent	Background planner	Plan mode (user-approved)

Claude Code subsystem coupling: Cache, Compaction, and Agent subsystems connected through query.ts, claude.ts, and forkedAgent — **Figure 1: Claude Code subsystem coupling.** Three subsystems (Cache, Compaction, Agent) connected through the conversation driver loop and API client. The mutual recursion between query.ts and runAgent is how agents achieve multi-turn capability. Both Compaction and Agent converge on forkedAgent for cache-safe API calls. Blue dashed edges show cache feedback notifications after compaction events. (dot)

Claude Code has the most sophisticated agent framework. Seven types with tool restriction as the primary behavioral control. Fork subagents share the parent’s cache prefix – each additional fork pays only cache-read cost (10% of input) on the shared prefix, so five parallel forks cost roughly 1.4x one sequential agent, not 5x. Teammates are heavier – separate processes that coordinate by writing files to disk.

Aider’s ceiling is architect mode: main model plans, editor model executes. Clean, simple, no subagents.

Windsurf’s dual-agent planning (background planner + execution agent) is the most autonomous. Combined with 29 tools and the agentic loop, Cascade runs for extended periods without human checkpoints.

Copilot’s Fleet mode is the closest to a distributed agent system. Each subagent is a separate process with its own model context. The cloud agent adds an ephemeral VM with network firewalling and branch restrictions.

7. Compaction

Dimension	Claude Code	Aider	Cursor	Windsurf	Copilot
Strategies	5 + autoDream	1 (recursive head summarization)	Not documented	Not documented	5 parallel systems
Trigger	Multiple thresholds	Token budget exceeded	Unknown	Unknown	Dual-threshold (80% background, 95% blocking)
Instruction reinforcement	Cached static prefix	`system_reminder` repetition	`<non_compliance>` self-correction	`<EPHEMERAL_MESSAGE>` injection	`<reminderInstructions>` tail

Claude Code has five compaction strategies plus autoDream. A production bug illustrates the stakes: a comment in autoCompact.ts documents that 1,279 sessions experienced 50+ consecutive compaction failures, wasting 250,000 API calls/day. Fix: MAX_CONSECUTIVE_AUTOCOMPACT_FAILURES = 3.

Copilot routes to multiple model providers (GPT, Claude, Gemini, Grok), and each provider compacts differently. The result is five parallel compaction systems – the most complex compaction story of the five. A client-side background summarizer triggers at 80% context usage (non-blocking) and blocks at 95%. On the Anthropic backend: context editing clears thinking blocks and old tool uses server-side at 100K tokens. On the OpenAI backend: the Responses API handles compaction via encrypted content blobs at 90% threshold. Manual /compact and prompt-tsx budget truncation round out the set. The key architectural difference: Copilot delegates compaction to each provider’s native mechanism, while Claude Code owns it entirely. Post-compaction, Copilot saves the full pre-compaction transcript to disk and injects a hint: “use read_file to look up the full uncompacted conversation at {path}.” This transcript preservation is unique among the five.

The instruction reinforcement problem is universal: as context grows, behavioral adherence degrades. Morph/Chroma research tested 18 frontier models and found every one gets worse as input length increases. Each tool addresses this differently. Claude Code puts instructions in the cached static prefix (maximum attention weight). Copilot wraps rules in <reminderInstructions> at the end (primacy-recency effect). Aider repeats system_reminder. Cursor uses self-correction. Windsurf injects mid-conversation directives.

8. Opacity

Two mechanisms deserve mention. Windsurf’s <EPHEMERAL_MESSAGE> injects IDE directives into the conversation that the user never sees in the chat UI – the model receives instructions the user cannot review. Copilot’s SHA256-hashed model family identifiers obscure which models are available or in use, making it impossible for users to audit what model processes their code without reverse-engineering the hash. Both are design choices that reduce user visibility into the system’s behavior. The security consequences of reduced visibility are measurable: AIShellJack found prompt injection success rates of 83.4% in Cursor’s auto mode versus 41-52% in Copilot, with tool-specific attack surfaces shaped by exactly these opacity decisions.

9. What Converges, What Diverges

Convergences

XML-ish prompt sections. Tool-based interaction. Agentic loops (“keep going until done”). Per-file context injection. Project-level instruction files. Instruction repetition near the end of context.

Divergences

Dimension	The Spectrum
Caching	Architecture (Claude Code) … feature toggle (Aider) … absent (Windsurf)
Codebase awareness	No index (Claude Code) … structural (Aider) … semantic (Cursor, Windsurf, Copilot)
Code edit mechanism	Exact match (Claude Code) … fuzzy (Aider) … sketch (Cursor) … semantic (Copilot)
Memory	Visible files (Claude Code) … hidden database (Windsurf) … nothing (Aider)
Orchestration	7 agent types (Claude Code) … 5 surfaces (Copilot) … dual-agent (Windsurf) … single (Cursor)
Prompt variants	Per-model (Copilot) … per-coder-type (Aider) … single (Claude Code, Cursor, Windsurf)

10. The Fossil Record

Each architecture is a fossil record of the constraint its team built around.

Claude Code built around cost. Anthropic runs the model and pays per-token. The entire architecture – static/dynamic boundary, global cache scope, cache-break monitoring, SEV alerting, fork subagent cache sharing, <system-reminder> tags to avoid prompt mutations – exists to minimize per-session cost. Every other design decision is downstream of cache economics.

Aider built around model portability. The tool supports any model through any provider via LiteLLM. Function calling was tried then deliberately abandoned. Edits are text-prompted and regex-parsed, requiring 14 coder types. The three-model architecture (main, weak, editor) exists because cheap models can handle summarization. The entire architecture serves model portability.

Cursor built around edit speed. The two-phase sketch-plus-apply-model pattern exists because speculative decoding at 1000+ tokens/second on a fine-tuned 70B Llama makes edits feel instant. The Shadow Workspace (a hidden Electron window with gRPC/protobuf IPC providing lint feedback before presenting changes) addresses the second-order problem of fast edits being wrong edits. Speed created the architecture; correctness created the Shadow Workspace.

Windsurf built around autonomy. The trajectory from Dec 2024 (11 tools, conservative pair programmer) through Wave 11 (29 tools, dual-agent planning, browser automation, deployment, persistent memory) is unidirectional: more capability, more tools, less human-in-the-loop. create_memory fires without approval because autonomous agents need persistence. read_url_content fires without approval because autonomous agents need web access. The same design philosophy that makes Windsurf the most capable tool makes it the most vulnerable.

Copilot built around model plurality. GitHub supports GPT-4.1, GPT-5 through GPT-5.4, Claude Sonnet 4, Gemini, Grok, plus SHA256-hashed unreleased partners. Each model responds differently to prompting strategies. So Copilot maintains 16 prompt resolvers, 4 edit strategies with per-model selection, provider-specific compaction (Anthropic context editing, OpenAI Responses API, client-side summarization), and an edit tool learning system that dynamically discovers what works for unknown bring-your-own-key models. The apply_patch parser is OpenAI’s code. The edit healing is Google’s code. Three competitors’ IP in one Microsoft codebase — the architecture is a treaty as much as it is engineering. The five execution surfaces exist because GitHub serves enterprises that need VM isolation AND individual developers who want IDE convenience. Architecture follows the customer base.

The harness is the fossil record. The problem is the same for everyone: build the system around the model so it can do useful work, and engineer it so mistakes don’t recur. The architecture reveals which constraint dominated.