@yibie: 推荐这篇文章,Superpowers 的作者让 Fable 5 跑了一个完整的 autoresearch loop——25 个实验,$165,把构建速度提高了 50%、token 开销降低了 60%。但这篇最值钱的不是结果数字,是他完整记…
摘要
Superpowers 6 发布,利用 Fable 5 进行 25 个自治实验,将构建速度提高 50%、token 开销降低 60%,并详细记录了实验过程和失败教训。
查看缓存全文
缓存时间: 2026/07/03 14:38
推荐这篇文章,Superpowers 的作者让 Fable 5 跑了一个完整的 autoresearch loop——25 个实验,$165,把构建速度提高了 50%、token 开销降低了 60%。但这篇最值钱的不是结果数字,是他完整记录了实验过程:每次失败、每个被证明“彻底死亡“的想法、三个被中途纠正的测量 bug。这是目前最完整的“用 Fable 做自治研发“实操报告。
Superpowers 6:用 Fable 5 跑 25 个自治实验,砍掉 60% 成本
一周前我们准备发的是 Superpowers 5.2——已经推迟了几次,加了“再多一个改进“。然后 Anthropic 发了(又收了)Fable。在那几天里我把它用到了极致。
Superpowers 用户最常抱怨的是 token 贵、构建慢。慢不应该是个问题——它发生在自治子 agent 驱动的构建编排中。但它确实是个问题。慢不好玩。贵也不好玩。
Fable 出来的时候,我决定看看它能把 Subagent Driven Development 优化到什么程度。我原本期望大概 15% 的 token 消耗降低。我得到了那个——还有更多。
第一次攻击:coordinator 到 reviewer 的交接
Fable 分析了几千个 Subagent Driven Development session,发现代码和 spec 合规审查子 agent 在做审查时跑了大量的 git 命令。把“怎么找要审查的 commit“的书面指令换成一段 shell 脚本——预生成一个包含格式化 diff 和元数据的审查包——token 消耗和墙上时间减少了约 10%。
那天晚上睡觉前我告诉 Fable:“看看能不能在我睡着的时候再砍 15% 时间和 token。“我在内部 Slack 上留了条消息:我们应该看看把代码 reviewer 和 spec 合规 reviewer 合并会发生什么。
我不知道我在期待什么。反正不是醒来发现 Fable 独立地得出了同样的结论,测试了它,发现在我们的 eval 套件上恰好省了那额外的 15%。
第二夜:自治研究循环
/goal 一旦完成,跑一个 autoresearch loop 来提升 superpowers 构建循环的成本效率。 用 opus 做协调器。建假设日志。跑实验。至少 25 个实验。
Fable 建了一套完整的 autoresearch harness 并且跑了一整夜。25 个实验跑完了,$165。
结果:可发货的候选方案(E27)——opus 控制器 + elicited plan + 条件化 haiku implementer + terse reviewer contract + narration recipe + 最终审查层固定。
有数字的胜利: terse reviewer contract 减少 reviewer 产出 41%,判定不变。narration recipe 减少 54%,零方差。条件化 implementer 分层约省 $0.5-1/次,而且 E22 证明它正确地拒绝了 haiku 处理 prose plan。
被证明彻底死亡的东西: 给控制器思考加帽适得其反——轮数从 92 升到 138,输出翻倍。plan 词数预算削减测试内容 62%,即使代码被豁免。Sonnet 生成 plan 保真度不变但毁掉任务结构。plan 中的实现内容体是边际的——测试 + 接口 + 结构承担了全部负载。
一个值得记住的风险发现: 只给 diff 包的审查员对 spec 做出自信的判定,却静默地把“spec“重新定义为全局约束——5 个里 0 个标记了缺失的简报。跟 haiku 审查员辩护同一个失败家族。
六个线索关闭为“已经最优“(report reads 缓存健康、审查员底线、haiku fixer、todo 簿记、dispatch 重推导)——记下来让没人再重复买这些教训。
我自己三个测量 bug 被中途抓修: 一个把模板回显跟自审查 catch 计在一起的 grep、一个从未内联 diff 的 harness、一个匹配漏了换行符的评分正则。一份被撤回的判定重新测量后干净了——-74% 变成了诚实的 -41%。
结果
跨 36 小时工作和约 $650 的未补贴 token 开销:Anthropic eval 基准上,构建墙上时间降 50%,token 开销降 60%。最大的改进来自合并 spec 合规和代码质量审查 agent、预烤给审查员的审查包让他们几乎不需要跑 git、以及改变我们给 orchestrator 的关于什么任务该用什么 agent 的指导。
然后在 Codex 上跑 eval——结果显示零改进。挖了几分钟:Codex 上的 eval 隔离不够好,一直在基准 Superpowers 5.1.0。修好后,所有结果都在。
一句话
Superpowers 6 证明:自治 Agent 研发不是一个 demo——是正在发生的事。 25 个实验,165 美元,一个通宵。每个实验有预先登记的假设。每个被否定的想法被记录下来。每次测量错误被中途纠正。这套 eval 基础设施让他们能够跨多种 harness 量化变化。这才是自治研发的正确形态。
原文:Jesse Vincent (obra), “Superpowers 6”, 2026-06-15 https://blog.fsck.com/2026/06/15/Superpowers-6/…
#Fable5 #Agent #自治研发 #Superpowers
Superpowers 6
Source: https://blog.fsck.com/2026/06/15/Superpowers-6/ You can also read this post on our corporate blog athttps://primeradiant.com/blog
TL;DR: Superpowers 6 is much, much faster and burns many fewer tokens to get the same high-quality outcomes. If you’re tokenmaxxing, maybe skip this release, but if you care about your builds being up to 50% faster and up to 60% cheaper, you’re going to love Superpowers 6.
A week ago, we were gearing up to release Superpowers 5.2. We’d slipped the release a couple of times already to add “just one more improvement.”
We added support for Pi, Antigravity and Kimi Code.
We made Superpowers work better on Codex and OpenCode, and Cursor.
We rewrote a bunch of the Superpowers skills to be model and harness agnostic, which helps them be more reliable everywhere. We also wrote a new contribution guide for how to add support for a new coding agent harness for Superpowers.
We did a bunch of work to make Visual Brainstorming easier to use, safer, and more reliable.
And we fixed a whole slew of bugs, including a particularly nasty one that led to code review subagents sometimes reviewing the whole branch, rather than a single task.
It was going to be a great release.
And then Anthropic shipped (and unshipped) Fable. In the few days that I had access to Fable, I put it to the best use that I could.
It’s no secret that the most common lament we hear from Superpowers users is that tokens are expensive and Superpowers uses a ton of them. Building software with Superpowers is slower than building without it, too. The “slow” part shouldn’t matter - it happens during the autonomous subagent driven development orchestration of the build process.
But it does matter. Slow isn’t fun. And expensive isn’t fun either.
A bunch of the reasons that Superpowers builds have taken longer and cost more are the same reasons that it delivers good outcomes for so many users. It does a ton of up-front planning work to make sure your implementations can be hands-off, forces strict red-green TDD while implementing, and then the orchestrator inside Superpowers reviews every single change on two axes:
- did the agent implement exactly what was asked, no more and no less.
- is the quality of the work up to snuff.
Just by the nature of what it’s doing, it’s going to be slower than yoloing an untested implementation and calling it a day.
But it’s never made mehappythat it’s slow and expensive.
When Fable came out, I decided to see how well it could optimize Subagent Driven Development.
I think I was hoping for something like a 15% reduction in token spend.
I got that. And a whole lot more.
Our first angle of attack was looking at the coordinator to reviewer handoff. Fable analyzed thousands of Subagent Driven Development sessions and found that code and spec-compliance review subagents sometimes ran alotof git commands while doing their reviews. Simply switching the written instructions for how to find the commits to review to a shell script that pre-generates a review package containing well-formatted diff and some other metadata decreased token spend and wall-clock time by about 10%.
As I was going to bed that evening, I told Fable to see about shaving another 15% off wall clock time and token cost for our evals while I slept.
As I was going to bed, I posted a note on our internal Slack that we should look at evaluating what happens if you combine the code reviewer and the spec compliance reviewer.
I don’t really know what I expected to happen overnight, but I don’t think it was waking up to find that Fable had independently come to the same conclusion, tested it, and found that across our eval suite, it saved that additional 15% I’d asked for.
The next night, I got a little more ambitious.
/goal once this is done, run an autoresearch loop to improve cost-efficiency of the superpowers build loop. test with opus as the coordinator. make an hypothesis log. run experiments. run at least 25 experiments.
Fable built out a full autoresearch harness and ran overnight. You can check out what it builton GitHub.
**The ==autoresearch== loop is complete: 25 experiments closed** (plus 4 logged as backlog), opus as coordinator throughout, every hypothesis pre-registered with a prediction in the log, now durable at evals `docs/experiments/2026-06-11-build-loop-==autoresearch==.md` (a76ddc9). Loop spend ≈ $165.
What the loop found
**The ship candidate (E27):** opus controller + elicited plan + conditional haiku implementers + terse reviewer contract + narration recipe + final-review tier pin → **fractals $6.24/$6.60** (the combo config on hand plans was $11.67–14.84). Planted-defect gates 2/3, with the one fail forensically attributed to reviewer variance plus judge strictness — the terse contract was explicitly exonerated by reading the reviewer's reports. The N=5 gate battery is still owed before any of it ships as skill text.
**Wins with numbers:** terse reviewer contract (−41% reviewer output, verdicts intact), narration recipe (−54%, zero variance), conditional implementer tiering (~$0.5–1/run, and E22 proved it correctly _refuses_ haiku on prose plans), fixture-realism replicated on svelte (−24% scope-matched).
**Things now provably dead:** capping controller thinking **backfires** — turns rose 92→138 and output doubled; thinking buys turn efficiency. Plan word budgets slash test content −62% even when code is exempted. Sonnet plan generation keeps fidelity but collapses task structure. Implementation bodies in plans are marginal — tests + interfaces + structure carry the whole load (E28).
**Risk finding worth remembering:** reviewers given only the diff package produce _confident_ spec verdicts that silently redefine "spec" as the global constraints — 0/5 flagged the missing brief. Same failure family as the haiku-reviewer advocacy.
**Six leads closed as already-optimal** (report reads, cache health, reviewer floor, haiku fixers, todo bookkeeping, dispatch re-derivation) — recorded so nobody re-buys them.
Three of my own measurement bugs were caught and corrected mid-loop by the manual-inspection discipline (a grep counting template echoes as self-review catches, a harness that never inlined the diff, a scorer regex missing newlines) — one retracted verdict was re-measured clean (−74% became the honest −41%).
The long and the short of it it is that across about 36 hours of work and what would have been $650 of unsubsidized token spend, our Anthropic eval benchmarks were looking like we’d reduced wall-clock runtime for Superpowers builds by 50% and token spend by 60%.
And then we ran our evals against Codex. The results were not good. I’d worried that they might not show the same level of improvement, but they showednoimprovement.
A few minutes of digging and we found the culprit. On Codex, the evals weren’t yet sufficiently isolated from the host OS....so we were always benchmarking Superpowers 5.1.0.
A little bit of fiddling later....yup. Everything held up.
The biggest improvements came from combining the spec compliance and code quality review agents, pre-baking the review “packet” handed to the reviewers so they rarely need to run git, and changing the guidance we give to the orchestrator about what kind of agent you need for a given task.
We’ve been working hard on our evals suite for Superpowers and without it, it would have been impossible to measure and test the changes we made. The suite is still relatively young, but it has meant that we’re able to make and test changes to Superpowers across a variety of supported harnesses and to quantify what those changes do across a growing set of coding agents. You can find it at https://github.com/prime-radiant-inc/superpowers-evals
We’re very proud of the improvements that we (and our robot buddies) have made in Superpowers 6. We think you’re going to love the new version.
You can install it right now fromhttps://github.com/obra/superpowers. It’ll start to percolate into the first party plugin marketplaces over the next couple of days.
PS: We’re hiring! If you know someone who should be working on Superpowers full time, please share the posting with them:https://primeradiant.com/jobs/superpowers-community-engineer/
相似文章
@iamai_omni: Fable 5 简直就是 ASI,这种自我纠错能力太让人震惊了。
用户iamai_omni称赞Fable 5的自我纠错能力,认为其堪比ASI。引用yibie的推荐,指出Superpowers作者让Fable 5运行autoresearch loop,花费$165完成25个实验,将构建速度提高50%、token开销降低60%,并详细记录了失败和纠错过程。
Superpowers 6
Superpowers 6大幅提升了开发速度和成本效率,通过Fable的优化实现最高50%更快构建和60%更低token消耗,同时改进了对多个AI模型和编码代理的支持。
@FinanceYF5: 天啊……Fable 5 回来了,而且强得离谱。 有人让 Fable 做了一款叫《超级智能竞速赛》的游戏…… 只用了 4 个提示词,花了价值 173 美元的 token,Fable 5 就做出了这个游戏。(提示词在下方)
Fable 5 模型仅通过4个提示词和173美元token就制作了一款名为《超级智能竞速赛》的游戏,展示了极强的生成能力。
@mylifcc: 用 Fable 5 做指导 + GPT 5.5 执行,是目前最聪明省钱的玩法。 我现在就在这样做,效果非常好,只要文档spec设计好,谁来执行差别不大,这样可以最大程度的放大Fable5的性价比。 核心方法: 先跟 Fable 聊一次,让…
分享一种使用Fable 5进行指导和代码审查、GPT 5.5执行的高效省钱玩法,强调通过handoff文档最大化性价比。
@RookieRicardoR: Fable 5 Max,五个任务,3300 代码,跑了 90 分钟,这对吗?
讨论了Fable 5 Max的运行情况(五个任务,3300行代码,90分钟),并指出使用Claude Code最新版本(170)时Fable 5的消耗是Ops 4.8的两倍。

