@mfpiccolo: Kaffu's "rich man's toy" line is the one of the sharp thing I've read on harnesses this year. He's right about the symp…
Summary
The tweet discusses the problem of bloat in AI agent harnesses, agreeing with Kaffu's critique that harnesses become "rich man's toys," and advocates for a composable architecture of small, replaceable workers to reduce drift and keep systems cheap and debuggable.
View Cached Full Text
Cached at: 06/02/26, 01:52 AM
Kaffu’s “rich man’s toy” line is the one of the sharp thing I’ve read on harnesses this year. He’s right about the symptom. I’d push back on one part of the diagnosis.
The bloat drift he names, agent engineering quietly turning into software engineering, is real. Every harness team I’ve talked to hits it around month nine. The framework you started with grows features it didn’t need, the system prompt swells, the retrieval layer doubles, the cost-per-task triples. Codex and Claude Code keep getting better, and you start to wonder what you’re building.
My extension: the drift is structural. It happens because the unit of work in a framework-shaped harness is the whole framework. To add a capability you grow the framework. To change a behaviour you fork the framework. The bloat has nowhere else to go.
When the unit shrinks to one narrow worker, one typed function, one job, drift loses its surface area. A retrieval worker that’s wrong gets replaced, not extended. The math-based reranker kaffu is right to advocate for becomes a worker that registers rerank::score. The fine-tuned RoBERTa becomes a worker that registers embed::generate. They sit next to the LLM provider worker on the same bus. The system stays cheap by being composable.
Simply, everything become a worker.
This doesn’t make harnesses economically valuable on its own. Kaffu’s deeper point stands. Most of what teams ship is fancy on paper and useless in production. The framework era encouraged that because the unit it sold was always too big.
I don’t know what the economically valuable harness looks like at steady state. I think it looks small. Small enough that every part is replaceable, every part is debuggable, every part is benchmarkable, and observable with observability worker against a 100-line fine-tuned alternative. The harness as a slider, not a monument.
For the love of the game.
Similar Articles
@mfpiccolo: https://x.com/mfpiccolo/status/2060069083878408689
The article argues that current agent harness frameworks like LangChain and CrewAI bundle independent concerns into a monolithic block, leading to inflexibility. It introduces the iii engine, where each responsibility is a separate, swappable worker connected via a shared bus and a single trigger primitive, allowing developers to compose their own harness by swapping workers rather than forking a framework.
@dair_ai: // State-Externalizing Harnesses // A new paradigm is emerging on how to effectively build agents and harnesses. If the…
Harness-1 introduces a state-externalizing harness that separates routine bookkeeping from policy decisions in search agents, enabling a 20B model to outperform larger frontier searchers across multiple benchmarks.
Observation: the best agent harness for each model will be from the model developer themselves
A discussion on how AI models perform best with harnesses developed by their own creators, as third-party harnesses may cause underperformance despite strong benchmarks, citing examples like Claude Code for Claude and Codex for GPT.
@oran_ge: Every team in the future will be doing harness engineering, and everyone needs to understand this framework. Although there are some non-consensus points, this is a good review.
An opinion piece suggesting that AI teams will increasingly focus on 'harness engineering' and advocating for a review article on the framework.
The Cost of Overfitting the Harness (2 minute read)
This article analyzes the implications of OpenAI potentially winding down fine-tuning, warning that frontier models may become overfitted to proprietary harnesses. It argues this shift could increase vendor lock-in and reduce model flexibility for third-party developers despite gains in reliability.