If Your Harness Is Getting More Complex, Something Is Wrong

Himanshu, writing on X, with a well-researched survey of how Claude Code, Cursor, Manus, and others actually build their AI agents:

The model is interchangeable. The harness is the product.

The evidence he marshals is compelling. Claude Opus 4.5 scores 42% on one benchmark with one scaffold and 78% with another. Vercel deleted 80% of their agent's tools and watched it go from failing to succeeding. Cursor's lazy tool loading cuts token usage by nearly half. Same models, different wrappers, wildly different outcomes.

This is the kind of piece the AI discourse needs more of — architecturally specific, well-sourced, and focused on what's actually shipping rather than what's theoretically possible. The progressive disclosure framing is sharp. The detail about Anthropic's TodoWrite tool — a no-op that does nothing except force the model to write down its plan — is the sort of thing that separates people who've read the code from people summarising launch posts.

But the headline claim overshoots the evidence underneath it. If the model were truly interchangeable, swapping models inside the same harness would produce identical scores. It doesn't. The same harness gives you 42% with one model and 62% with another. What the data actually shows is that harness quality and model quality multiply each other. A great harness makes a good model dramatically better. That's a less tweetable thesis, but it's the accurate one.

There's also a tension the piece doesn't resolve. Cursor tunes its harness per model — different tool names, different prompts, different behavioural guidance for each frontier model. Manus has rewritten their framework five times. If the harness is the stable product layer and the model is the commodity input, why does changing the model keep breaking the harness? What you've actually got is co-evolution, not clean abstraction. The harness isn't a car that accepts any engine. It's a car whose transmission has to be rebuilt every time you swap one in.

The survivorship bias is worth naming too. Every company profiled here shipped a working agent. We're reverse-engineering their design choices and calling those choices principles. But for every team that arrived at "simple flat loop," there's presumably a team that tried the same thing and failed in ways that don't get written up in engineering blogs. "Keep it simple" is the conclusion of the winners. It is not necessarily the reason they won.

All that said — if you're building agents, read the whole thing. The technique matrix alone is worth your time, and the architectural survey across companies is more thorough than most published accounts.

The line that should have been the headline, buried near the end: "If your agent harness is getting more complex while models get better, something is wrong." That's the real insight. The rest is evidence.

More from the blog

Stay current weekly