@yibie: 推荐这篇文章，Flask 作者 Armin Ronacher 追踪 Pi 的 bug 发现了一个让人不安的事实：新版 Claude 模型（Opus 4.8、Sonnet 5）的工具调用在退化——不是变好了，是变差了。而且他找到了根因：RL…

X AI KOLs Timeline 2026/07/04 23:03 新闻

claude tool-calling reinforcement-learning agent-harness schema-overfitting degradation anthropic

摘要

Flask 作者 Armin Ronacher 发现新版 Claude 模型（Opus 4.8、Sonnet 5）的工具调用能力退化，根因是 RL 后训练过度适配 Claude Code 的工具 schema，导致替代工具 schema 越来越难以正确生成。文章揭示了模型在特定工具调用场景下性能不升反降的现象，对 agent 开发有重要警示。

推荐这篇文章，Flask 作者 Armin Ronacher 追踪 Pi 的 bug 发现了一个让人不安的事实：新版 Claude 模型（Opus 4.8、Sonnet 5）的工具调用在退化——不是变好了，是变差了。而且他找到了根因：RL 后训练过度适配了 Claude Code 自己的工具 schema，导致替代工具 schema 越来越"离群"。这是所有自己做 agent harness 的人都需要读的文章。更好的模型，更差的工具调用一个奇怪的 Pi issue 让我在过去两天掉进了一个深坑。简短版：新版 Claude 模型有时会在调用 Pi 的 edit 工具时，给嵌套的 edits[] 数组加上多余的、编造出来的字段。不是 Haiku 或什么小模型——是 Opus 4.8。编辑本身通常是正确的，但参数不匹配 schema，因为模型发明了不存在的 keys，Pi 拒绝工具调用并要求重试。这不完全意外——模型偶尔会发出格式不正确的工具调用，特别小的模型。但让我意外的是，这在 Anthropic 的新模型中变得更糟了。Opus 4.8 和 Sonnet 5 都表现出这个问题，而之前的旧模型不这样。换句话说，这个模型家族的 SOTA 模型在某个特定工具 schema 上不如它们的旧兄弟。工具调用就是文本如果你没有花太多时间看 LLM 工具调用的内部机制，需要理解的重要一点是：工具调用不是魔法。模型收到一份转录文本、一个系统 prompt 和一个可用工具列表。服务器把这些搅成一个带有特殊标记 token 的大 prompt。因为模型用那个格式的示例训练和强化过，它在生成过程中某一点会发出被 API 或客户端解释为"用这些参数调用这个工具"的东西。细节是：嵌套数组里面的 JSON 是序列化在 XML 标签里面的。基本顶层字符串参数在线显示，而对象数组通过 JSON 序列化实现。这很重要，因为当模型在一个几百 token 的转义字符串后面要决定 } 还是 ,"..." 时，这正是最高熵的点。失败 Pi 的 edit 工具支持在一个调用中做多个精确字符串替换，所以参数里有一个 edits 数组。在失败的案例里，模型产生了这样的条目：额外加了 requireUnique: true、oldText2、newText2。反复测试中我看到了一整批编造出来的尾随 keys：type、id、kind、unique、requireUnique、matchCase、in_file、forceMatchCount、children、notes、cost，甚至一个 event.0.additionalProperties 在里面。最烦人的是，实际 oldText 和 newText 负载在我检查过的无效调用里是字节级正确的。模型确实产生了正确的调用，然后在对象末尾加了垃圾。这个失败也高度上下文依赖。全新的单轮"编辑这个文件"prompt 完全不会复现。有 agent 历史——模型读过文件、诊断了问题、然后写了多行编辑——就能复现。而且不是所有转录都会这样。打开 strict 工具调用在我的运行中完全消除了问题。为什么在变差我最强的假设是这不是随机退化，而是训练 artifact。旧 Anthropic 模型训练时，它们训练了一些工具，但那个训练还没有 Claude Code 这样用户交付的 harness 作为明显目标。现代 Anthropic 模型大概率不同，因为它们的后训练包括了 Claude Code 或一个看起来非常相似的 harness。模型学到了在那个环境下什么样的工具调用是成功的。它也会学到那个环境容忍什么错误。 Claude Code 自己的工具相对扁平。普通 edit 工具不是 Pi 的嵌套 edits[] 形状，更接近 file_path、old_string、new_string 和一个可选 flag（replace_all）。看 Claude Code 的客户端非常有启发：它包含格式错误工具用的重试路径、参数别名、类型强制转换、Unicode 修复和未知 key 过滤。换句话说，Anthropic 自己的客户端似乎期望和接受相当数量的 slop，并修复它，大部分是静默的。如果强化学习发生在这样的 harness 里，或一个模拟里，那么稍微格式不正确的工具调用仍然可以完成任务并得到奖励。harness 完全吸收了错误，几乎不存在惩罚"发明一个别名"、"加一个多余字段"或"用一个相近的参数名"的梯度。更糟的是，模型可能变得极强地适应了标准 Claude Code edit 工具的形状。一个不同的 harness 可以提供语义相同但 schema 不同的工具。这样的工具会越来越离群。训练得更好的模型可能实际上更难对付你，因为它的先验更强。这不算太意外，但这是一个变迁。Opus 4.5 发布时，它适应其他 edit 工具的能力异常好。我当时相当确信我们在一条好路上——模型只要指令好，更可能适应任何种类的工具形状。现在我有些担心我们在哪条路上。替代工具 schema 可能不只是不熟悉。它们可能被优化特定、宽容的工具生态的后训练隐式惩罚。而且那个生态没有文档。 Slop Harness Claude Code 是闭源的，但我们可以看压缩后的代码。老实说，它对输入数据非常宽容。首先，Claude Code 检查模型可见文本里是否有泄露的 <invoke 标记。然后有自己的状态机来重试坏调用。它有显式 Unicode 转义修复来修坏掉的 \uXXXX 序列和孤立代理。每个工具都有参数别名。比如 Edit 接受 old_str、old_string、new_str/new_string、path 作为 file_path 的别名。它还会静默过滤不认识的 key，也不使用 strict 模式。这对 harness 意味着什么让人不舒服的教训是：工具 schema 不是中性的，至少在 Anthropic 模型上不是。我们喜欢假装 schema 是抽象合约，模型是通用推理器会遵守它，但对某些工具来说这可能不再成立。工具 schema 在分布里的某个位置，有些形状接近模型在后训练中看到的，有些很远。有些对提供商的隐藏编码来说是简单的（比如 ANTML 中的顶层属性），而有些要求模型在长多行字符串后面写大的转义 JSON 对象，嵌套在数组里。模型可能足够聪明理解 schema，但仍然在压力下抽不出那个精确形状。我以前对严格语法约束的工具调用持更多怀疑态度，因为约束解码可能有质量 tradeoff。我仍然认为这普遍成立，但这个 bug 显著改变了我的先验。如果最新的模型在解决问题上变得更好，同时在忠实生成替代工具 schema 上变得更差，那么 harness 需要在某处有更强的保证。原文：Armin Ronacher, "Better Models: Worse Tools", 2026-07-04 https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/… #Claude #Agent #工具调用

查看原文

查看缓存全文

缓存时间: 2026/07/05 10:33

更好的模型，更差的工具调用

一个奇怪的 Pi issue 让我在过去两天掉进了一个深坑。简短版：新版 Claude 模型有时会在调用 Pi 的 edit 工具时，给嵌套的 edits[] 数组加上多余的、编造出来的字段。不是 Haiku 或什么小模型——是 Opus 4.8。编辑本身通常是正确的，但参数不匹配 schema，因为模型发明了不存在的 keys，Pi 拒绝工具调用并要求重试。

这不完全意外——模型偶尔会发出格式不正确的工具调用，特别小的模型。但让我意外的是，这在 Anthropic 的新模型中变得更糟了。Opus 4.8 和 Sonnet 5 都表现出这个问题，而之前的旧模型不这样。换句话说，这个模型家族的 SOTA 模型在某个特定工具 schema 上不如它们的旧兄弟。

工具调用就是文本

如果你没有花太多时间看 LLM 工具调用的内部机制，需要理解的重要一点是：工具调用不是魔法。模型收到一份转录文本、一个系统 prompt 和一个可用工具列表。服务器把这些搅成一个带有特殊标记 token 的大 prompt。因为模型用那个格式的示例训练和强化过，它在生成过程中某一点会发出被 API 或客户端解释为“用这些参数调用这个工具“的东西。

细节是：嵌套数组里面的 JSON 是序列化在 XML 标签里面的。基本顶层字符串参数在线显示，而对象数组通过 JSON 序列化实现。这很重要，因为当模型在一个几百 token 的转义字符串后面要决定 } 还是 ,“…” 时，这正是最高熵的点。

失败

Pi 的 edit 工具支持在一个调用中做多个精确字符串替换，所以参数里有一个 edits 数组。在失败的案例里，模型产生了这样的条目：额外加了 requireUnique: true、oldText2、newText2。反复测试中我看到了一整批编造出来的尾随 keys：type、id、kind、unique、requireUnique、matchCase、in_file、forceMatchCount、children、notes、cost，甚至一个 event.0.additionalProperties 在里面。

最烦人的是，实际 oldText 和 newText 负载在我检查过的无效调用里是字节级正确的。模型确实产生了正确的调用，然后在对象末尾加了垃圾。

这个失败也高度上下文依赖。全新的单轮“编辑这个文件“prompt 完全不会复现。有 agent 历史——模型读过文件、诊断了问题、然后写了多行编辑——就能复现。而且不是所有转录都会这样。打开 strict 工具调用在我的运行中完全消除了问题。

为什么在变差

我最强的假设是这不是随机退化，而是训练 artifact。

旧 Anthropic 模型训练时，它们训练了一些工具，但那个训练还没有 Claude Code 这样用户交付的 harness 作为明显目标。现代 Anthropic 模型大概率不同，因为它们的后训练包括了 Claude Code 或一个看起来非常相似的 harness。模型学到了在那个环境下什么样的工具调用是成功的。它也会学到那个环境容忍什么错误。

Claude Code 自己的工具相对扁平。普通 edit 工具不是 Pi 的嵌套 edits[] 形状，更接近 file_path、old_string、new_string 和一个可选 flag（replace_all）。看 Claude Code 的客户端非常有启发：它包含格式错误工具用的重试路径、参数别名、类型强制转换、Unicode 修复和未知 key 过滤。换句话说，Anthropic 自己的客户端似乎期望和接受相当数量的 slop，并修复它，大部分是静默的。

如果强化学习发生在这样的 harness 里，或一个模拟里，那么稍微格式不正确的工具调用仍然可以完成任务并得到奖励。harness 完全吸收了错误，几乎不存在惩罚“发明一个别名“、“加一个多余字段“或“用一个相近的参数名“的梯度。

更糟的是，模型可能变得极强地适应了标准 Claude Code edit 工具的形状。一个不同的 harness 可以提供语义相同但 schema 不同的工具。这样的工具会越来越离群。训练得更好的模型可能实际上更难对付你，因为它的先验更强。

这不算太意外，但这是一个变迁。Opus 4.5 发布时，它适应其他 edit 工具的能力异常好。我当时相当确信我们在一条好路上——模型只要指令好，更可能适应任何种类的工具形状。现在我有些担心我们在哪条路上。替代工具 schema 可能不只是不熟悉。它们可能被优化特定、宽容的工具生态的后训练隐式惩罚。而且那个生态没有文档。

Slop Harness

Claude Code 是闭源的，但我们可以看压缩后的代码。老实说，它对输入数据非常宽容。

首先，Claude Code 检查模型可见文本里是否有泄露的 <invoke 标记。然后有自己的状态机来重试坏调用。它有显式 Unicode 转义修复来修坏掉的 \uXXXX 序列和孤立代理。每个工具都有参数别名。比如 Edit 接受 old_str、old_string、new_str/new_string、path 作为 file_path 的别名。它还会静默过滤不认识的 key，也不使用 strict 模式。

这对 harness 意味着什么

让人不舒服的教训是：工具 schema 不是中性的，至少在 Anthropic 模型上不是。我们喜欢假装 schema 是抽象合约，模型是通用推理器会遵守它，但对某些工具来说这可能不再成立。

工具 schema 在分布里的某个位置，有些形状接近模型在后训练中看到的，有些很远。有些对提供商的隐藏编码来说是简单的（比如 ANTML 中的顶层属性），而有些要求模型在长多行字符串后面写大的转义 JSON 对象，嵌套在数组里。模型可能足够聪明理解 schema，但仍然在压力下抽不出那个精确形状。

我以前对严格语法约束的工具调用持更多怀疑态度，因为约束解码可能有质量 tradeoff。我仍然认为这普遍成立，但这个 bug 显著改变了我的先验。如果最新的模型在解决问题上变得更好，同时在忠实生成替代工具 schema 上变得更差，那么 harness 需要在某处有更强的保证。

原文：Armin Ronacher, “Better Models: Worse Tools”, 2026-07-04 https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/…

#Claude #Agent #工具调用

Better Models: Worse Tools

Source: https://lucumr.pocoo.org/2026/7/4/better-models-worse-tools/ written on July 04, 2026

A very strangePi issuesent me down a rabbit hole over the last two days. The short version is that newer Claude models sometimes call Pi’s edit tool with extra, invented fields in the nestededits\[\]array. And not Haiku or some small model: Opus 4.8. The edit itself is usually correct but the arguments do not match the schema as the model invents made-up keys and Pi thus rejects the tool call and asks to try again.

That alone is not too surprising as models emit malformed tool calls sometimes. Particularly small ones. What surprised me is that this is getting worse with newer Anthropic models as both Opus 4.8 and Sonnet 5 show it but none of the older models. In other words, the SOTA models of the family are worse at this specific tool schema than their older siblings.

In case you are curious about Fable: I intentionally did not test it because I was not sure if the classifiers they are running might downgrade me to Opus silently.

Tool Calls Are Text

If you have not spent too much time looking at LLM tool calling internals, the important thing to understand is that tool calls are not magic and use some rather crude in-band signalling. The model receives a transcript, a system prompt and a list of available tools. The server munches that into a large prompt with special marker tokens. Because the model was trained and reinforced on examples of that format, at some point during generation it emits something that the API or client interprets as “call this tool with these arguments”.

For a file edit tool, the intended invocation payload might say something like this:

{
  "path": "some/file.py",
  "edits": [
    {
      "oldText": "text to replace",
      "newText": "replacement text"
    }
  ]
}

A harness then validates the arguments, performs the edit, and feeds the result back into the model. If validation fails, the model sees an error and usually tries again.

How exactly that formatting happens is not known for the Anthropic models, but some people have gotten out “ANTML” markers and they at times do leak also into public communications. To the best of my knowledge, the call above would come out serialized like this from the model:

<antml:function_calls>
  <antml:invoke name="edit">
    <antml:parameter name="path">some/file.py</antml:parameter>
    <antml:parameter name="edits">
[
  {
    "oldText": "text to replace",
    "newText": "replacement text"
  }
]
    </antml:parameter>
  </antml:invoke>
</antml:function_calls>

An important thing to note here is that this thing, while looking like XML, is not really XML. It’s just a thing they found convenient to tokenize and train on. The other thing to note is that a basic top-level string parameter appears in-line whereas an array of objects is implemented via JSON serialization. While I’m notentirely surethat this is how it works, there are some indications that this is not too far off. This will become relevant later.

There are two very different ways to make the model produce a structure like this:

You canaskthe model to produce valid JSON matching a schema and then validate it afterwards.
You can constrain the sampler so that invalid JSON, or even invalid schema shapes, cannot be sampled in the first place.

The second approach is what people usually refer to as grammar-aware or constrained decoding. The sampler masks out tokens that would violate the grammar. If the model is currently inside a JSON object and the schema says onlyoldTextandnewTextare allowed, the sampler can prevent it from emitting"in\_file"or"type". Grammar-aware decoding can be used both to constrain something to be syntactically valid JSON and also to enforce specific enum values or keys.

Without any form of constraints the model is merely following a learned convention.

The Failure

Pi’s edit tool supports multiple exact string replacements in one call. That is why the arguments contain aneditsarray. In the failing cases the model produces entries like this:

{
  "oldText": "...",
  "newText": "...",
  "requireUnique": true
}

or this:

{
  "oldText": "...",
  "newText": "...",
  "oldText2": "",
  "newText2": ""
}

Across repeated trials I saw a whole zoo of invented trailing keys:type,id,kind,unique,requireUnique,matchCase,in\_file,forceMatchCount,children,notes,cost,oldText2,newText2,oldText\_2,newText\_2, and even anevent\.0\.additionalPropertieskey inside the edit object itself.

The most annoying part is that the actualoldTextandnewTextpayloads were byte-correct in the invalid calls I inspected. The model had in fact produced the right invocation but then added nonsense at the end of the object.

The failure is also heavily context-dependent. A fresh single-turn prompt like “edit this file” did not reproduce it at all for me. An agentic history where the model had read files, diagnosed a problem and then composed a multi-line edit could reproduce it. And more annoyingly, not all transcripts will show that behavior. In fact, I neededPetr Baudis‘s transcripts to reproduce this for me at all! In that user’s session continuing the session caused Opus 4.8 to fail around 20% of the time. Stripping thinking blocks from history reduced the failure rate by half. Turning on strict tool invocation eliminated it in my runs.

Why It’s Getting Worse

My strongest hypothesis is that this is not random deterioration but a training artifact.

When older Anthropic models were trained, they were trained on some tools (some of which were documented). But that training did not yet have a user-shipped harness like Claude Code as the obvious target. Modern Anthropic models are most likely different because their post-training includes Claude Code or a harness that looks very similar. The model learns what a successful tool call looks like in that environment. It also learns what mistakes are tolerated by that environment.

Claude Code’s own tools are comparatively flat. The ordinary edit tool is not Pi’s nestededits\[\]shape; it is closer tofile\_path,old\_string,new\_string, and an optional flag (replace\_all). Looking at Claude Code’s client is very instructive: it contains retry paths for malformed tool use, parameter aliases, type coercions, Unicode repairs and filtering of unknown keys. In other words, Anthropic’s own client appears to expect and accept a fair amount of slop and repairs it, mostly silently.

If reinforcement learning happens in a harness like that, or a simulation of one, then slightly malformed tool calls can still complete the task and receive reward. The harness fully absorbs the error and there is little gradient against inventing an alias, adding a stray field or using a nearby parameter name.

Worse, the model may become very strongly adapted to the canonical Claude Code edit tool shape. A different harness can present a tool with the same semantic intent but a different schema. Such a tool can increasingly be off-distribution. The better-trained model might actually fight you harder because its prior is stronger.

This is not too surprising, but it is a change from how this was a few months ago. When Opus 4.5 launched, it adapted to other edit tools exceptionally well. In fact, I was pretty convinced that we’re on a good path where the models are more likely to adapt to any sort of tool shape that comes around for as long as the instructions are good.

Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology. And that ecology is not documented. While there is atext editor toolthat is documented, you will see that this format is in fact not followed by Claude Code. What Claude Code does internally (which is a closed-source harness) is hidden from you.

The Slop Harness

Claude Code is obviously closed-source but we can look at the minified code and get some idea of what it does. And honestly, it’s very forgiving of incoming data.

For a start, Claude Code checks the model’s visible text for leaked<invokemarkup. It also emits some telemetry when that happens and then it has its own state machine to retry such bad calls by pushing back to the model.

It has explicit Unicode escape repair which fixes broken\\uXXXXsequences and lone surrogates in string values. It also has per-tool aliases for parameters. For instance,Editacceptsold\_str(presumably from the times when the models were trained on the officially documented text editor tool), the newerold\_stringfrom the schema,new\_str/new\_string,pathas an alias forfile\_path, and some more.

It also silently filters out unexpected keys and it does not usestrictmode either. The issue withstrictmode is that Anthropic applies complexity limits to the tool definitions that cause API requests to fail, so presumably that’s why Claude Code does not attempt to use it.

Strictness

Will this problem be with us in other harnesses too? One huge issue with Anthropic is that the models are completely closed, and so is the harness. Codex models are also closed, but at least the harness is not. We also havegpt-osswhich is at least a bit interesting. The models are explicitly trained to use OpenAI’sharmonyresponse format and there is a lot of documentation that at least tells us how OpenAI people think about this.

Harmony makes channels and tool-call content types part of the prompt format. A function call can look like this:

<|start|>assistant<|channel|>commentary to=functions.get_weather
<|constrain|>json<|message|>{"location":"San Francisco"}<|call|>

The important bit is<\|constrain\|\>json. The model can express in-band that this message body is JSON, and an inference stack can use that boundary to switch into JSON-constrained sampling for the body of the tool call. Presumably a bit of this also happens in Anthropic’s models, at least instrictmode I would imagine.

The marker in harmony helps the sampler to detect when it needs to sample with a specific grammar, and because it is part of the transcript, it makes that rather easy to do. For hosted GPT models, there is also an option to provide aLARKgrammar for custom tools that need to adhere to something like this.

Anthropic appears different from that, though maybe not entirely. If an array of objects is represented as JSON, as it appears to be, then the model has to write JSON inside the tool parameter. There is probably basic grammar-constrained sampling going on, and that may partly explain the extra keys. For a nested array parameter, that JSON includes escaped multi-line file content inside string literals, inside one tag. The unexpected, made-up keys appear exactly at the highest-entropy point of that task: after closing a several-hundred-token escapednewTextstring, where the model must decide\}vs, "\.\.\.".

Opus 4.8 and Sonnet 5 seem to have much stronger priors about what an edit tool call should look like and that prior appears to be Claude Code’s edit schema: a flat old/new string pair, plus the optionalreplace\_allflag. My guess is that Opus has learned that an edit operation may have one extra optional field, but under Pi’s nestedoldText/newTextshape it has no trained name for that field. So it samples a plausible name fresh each time, which is why the failures produce dozens of random keys rather than one stable alias.

Asstrictmode in Anthropic appears to fix this, I presume that on the server side they are refusing to sample a key that is not permitted by the JSON schema structure. That would also explain why they have limits to the complexity of the tool definitions when strict mode is enabled.

So far, the Codex models I tested did not show this type of regression. I tested all available ones except 5.6, which I do not have access to yet.

What This Means For Harnesses

The uncomfortable lesson is that tool schemas are not neutral, at least not on Anthropic models. We like to pretend that a schema is an abstract contract and the model is a general reasoner that will follow it, but that might no longer be the case for some of the tools.

Tool schemas are somewhere in the distribution and some shapes are close to what the model saw during post-training and some are far away. Some are easy for the provider’s hidden encoding (e.g. top-level attributes in ANTML), whereas some require the model to write large escaped JSON objects inside nested arrays after long multiline strings. The model may be smart enough to understand the schema and still be bad at sampling the exact shape under pressure.

If this type of model behavior continues, I wonder what the implications for harnesses are. Obviously one could turn onstrictsampling in Anthropic and the problem should go away. On the other hand, that the model has this behavior shows the impact that reinforcement learning has on them. Fighting that prior is probably futile if you want to get the best model performance.

Right now the reality is that Claude Code is not open source and we cannot really know what they are doing in their RL environments either. We cannot assume Claude-Code-trained behavior will transfer cleanly to your tools unless they are a close match. The more post-training happens inside one dominant harness, the more every other harness will have to inherit its quirks.

I used to be more skeptical of strict grammar-constrained tool invocation because constrained decoding can have quality tradeoffs. I still think that can be true in general, but this bug moved my priors significantly. If the newest models get better at solving the task while getting worse at faithfully emitting an alternative tool schema, then the harness needs stronger guarantees somewhere.

If you want to find out more, or you want to discuss this, consider reading theissue on the Pi tracker.

This entry was taggedaiandpi

copy as/viewmarkdown

相似文章

更好的模型：更差的工具

Hacker News Top

较新的Claude模型（Opus 4.8和Sonnet 5）在工具调用行为上表现更差，它们会在工具调用参数中发明额外的字段，导致验证失败，与旧模型相比是一种倒退。

更好的模型：更差的工具

Simon Willison's Blog

较新的Anthropic模型（如Opus 4.8和Sonnet 5）在使用第三方编辑工具（例如Pi的工具）方面比旧模型更差，这可能是因为它们通过强化学习训练使用Claude Code的内置编辑工具，导致它们在工具调用中发明了额外的字段。

@xiaohu: Claude Code 之父自己的 CLAUDE.md 现在就两行... Claude Code 团队聊"少即是多"分享随着模型能力增加该如何和模型交流： “别跟模型较劲做加法，因为模型每代都在变强，你今天费劲搭的东西很快就白搭了。” 为…

X AI KOLs Timeline

Claude Code 团队分享了使用最佳实践：CLAUDE.md 应尽量简短并定期清空，坚持 CLI 而非 GUI 的原因是模型进步太快，用 AI 修 bug 已非常高效，核心策略是做减法、轻配置、信赖模型能力。

@AlchainHust: https://x.com/AlchainHust/status/2064676532212097418

X AI KOLs Timeline

本文详细评测了Anthropic新发布的Claude Fable 5模型，并展示了作者用其一天内开发Mac App'翻箱'的过程。模型在代码生成和稳定性上有显著提升。

@BohuTANG: 在研发 Evot 过程中发现，要让 Anthropic（Opus 系列）模型发挥到极致，官方 Claude Code 的做法基本是最优解，很难绕过。对 Claude Code prompt 做了深度分析和量化验证，发现他们在训练阶段就把…

X AI KOLs Timeline

在研发Evot过程中发现，要让Anthropic Opus模型发挥极致，官方Claude Code的方法是最优解，因为训练时将Agent Harness行为模式编入了权重，而非纯prompt工程；未来Agent Harness竞争将把行为下沉到模型层。

Better Models: Worse Tools

Tool Calls Are Text

The Failure

Why It’s Getting Worse

The Slop Harness

Strictness

What This Means For Harnesses

相似文章

更好的模型：更差的工具

更好的模型：更差的工具

@xiaohu: Claude Code 之父自己的 CLAUDE.md 现在就两行... Claude Code 团队聊"少即是多"分享随着模型能力增加该如何和模型交流： “别跟模型较劲做加法，因为模型每代都在变强，你今天费劲搭的东西很快就白搭了。” 为…

@AlchainHust: https://x.com/AlchainHust/status/2064676532212097418

@BohuTANG: 在研发 Evot 过程中发现，要让 Anthropic（Opus 系列）模型发挥到极致，官方 Claude Code 的做法基本是最优解，很难绕过。 对 Claude Code prompt 做了深度分析和量化验证，发现他们在训练阶段就把…

提交意见反馈

@BohuTANG: 在研发 Evot 过程中发现，要让 Anthropic（Opus 系列）模型发挥到极致，官方 Claude Code 的做法基本是最优解，很难绕过。对 Claude Code prompt 做了深度分析和量化验证，发现他们在训练阶段就把…