@yibie: Recommend this article. The author of Superpowers ran a complete autoresearch loop with Fable 5 — 25 experiments, $165, improving build speed by 50% and reducing token costs by 60%. But the most valuable part of this article is not the result numbers; it's the complete record of the process…

X AI KOLs Timeline 07/03/26, 08:48 AM Products

autonomous-research fable-5 superpowers token-optimization agent-development cost-reduction experiment

Summary

Superpowers 6 is released, using Fable 5 to run 25 autonomous experiments, improving build speed by 50% and reducing token costs by 60%, with detailed records of the experimental process and lessons from failures.

Recommend this article. The author of Superpowers ran a complete autoresearch loop with Fable 5 — 25 experiments, $165, improving build speed by 50% and reducing token costs by 60%. But the most valuable part of this article is not the result numbers; it's the complete record of the experimental process: each failure, each idea proven "completely dead", three mid-course corrected measurement bugs. This is the most comprehensive practical report on "using Fable for autonomous R&D" so far. **Superpowers 6: 25 Autonomous Experiments with Fable 5, Slashing 60% Costs** A week ago we were about to release Superpowers 5.2 — already delayed several times, adding "one more improvement." Then Anthropic released (and then retracted) Fable. In those few days, I pushed it to its limits. The most common complaints from Superpowers users are high token costs and slow builds. Slow shouldn't be a problem — it occurs in build orchestration driven by autonomous sub-agents. But it is a problem. Slow isn't fun. Expensive isn't fun either. When Fable came out, I decided to see how much it could optimize Subagent Driven Development. I was expecting roughly a 15% reduction in token consumption. I got that — and more. **First attack: the handoff from coordinator to reviewer** Fable analyzed thousands of Subagent Driven Development sessions and found that the code and spec compliance review sub-agents were running a large number of git commands during reviews. Replacing the written instructions on "how to find the commit to review" with a shell script — pre-generating a review package containing formatted diffs and metadata — reduced token consumption and wall clock time by about 10%. Before going to bed that night, I told Fable: "See if you can cut another 15% in time and tokens while I sleep." I left a message on internal Slack: we should look into merging the code reviewer and spec compliance reviewer. I didn't know what I was expecting. Certainly not waking up to find that Fable had independently reached the same conclusion, tested it, and found that it saved exactly that additional 15% on our eval suite. **Second night: Autonomous research loop** /goal Once complete, run an autoresearch loop to improve the cost efficiency of superpowers build loops. Use opus as coordinator. Build hypothesis log. Run experiments. At least 25 experiments. Fable built a complete autoresearch harness and ran it all night. 25 experiments completed for $165. Results: Shipable candidate (E27) — opus controller + elicited plan + conditional haiku implementer + terse reviewer contract + narration recipe + final review layer fixed. Wins with numbers: terse reviewer contract reduced reviewer output by 41% with no change in judgment. narration recipe reduced it by 54% with zero variance. Conditional implementer layering saved about $0.5-1 per run, and E22 proved it correctly rejected haiku from handling prose plans. Things proven completely dead: Capping the controller's thinking backfired — rounds went from 92 to 138, output doubled. Plan word budget cut test content by 62%, even though code was exempt. Sonnet-generated plan preserved fidelity but destroyed task structure. Implementation body in plans was marginal — tests + interfaces + structure bore all the load. A risk worth noting: Reviewers given only diff packages made confident judgments about the spec but silently redefined "spec" as global constraints — 0 out of 5 marked missing briefs. Same failure family as defending with haiku reviewers. Six clues closed as "already optimal" (report reads cache health, reviewer baseline, haiku fixer, todo bookkeeping, dispatch re-derivation) — noted so nobody pays for these lessons again. Three of my own measurement bugs caught mid-course: a grep that counted template echoes along with self-review catches, a harness that never inlined diffs, and a scoring regex that missed newlines. One retracted judgment was clean after remeasurement — -74% became honest -41%. **Results** Across 36 hours of work and approximately $650 in unsubsidized token costs: On the Anthropic eval benchmark, build wall time down 50%, token costs down 60%. The biggest improvements came from merging the spec compliance and code quality review agents, pre-baking review packages for reviewers so they hardly need to run git, and changing how we instruct the orchestrator about which tasks should use which agents. Then ran the eval on Codex — results showed zero improvement. Dug for a few minutes: The eval isolation on Codex was not good enough, and it was benchmarking against Superpowers 5.1.0. After fixing, all results held. **One sentence** Superpowers 6 proves: Autonomous agent R&D is not a demo — it's happening. 25 experiments, $165, one overnight. Each experiment with a preregistered hypothesis. Every rejected idea recorded. Every measurement error corrected mid-course. This eval infrastructure allows them to quantify changes across multiple harnesses. This is the right form of autonomous R&D. Original: Jesse Vincent (obra), "Superpowers 6", 2026-06-15 https://blog.fsck.com/2026/06/15/Superpowers-6/… #Fable5 #Agent #AutonomousR&D #Superpowers

Original Article

View Cached Full Text

Cached at: 07/03/26, 02:38 PM

Recommend this article. The author of Superpowers ran a full autoresearch loop with Fable 5 — 25 experiments, $165, improving build speed by 50% and cutting token costs by 60%. But the most valuable part isn’t the final numbers — it’s the complete record of the experimental process: every failure, every idea that was “proven dead,” and three measurement bugs corrected along the way. This is the most comprehensive hands-on report on “autonomous R&D with Fable” available today.

Superpowers 6: Running 25 Autonomous Experiments with Fable 5, Cutting Costs by 60%

A week ago, we were gearing up to release Superpowers 5.2 — already delayed a few times to add “just one more improvement.” Then Anthropic shipped (and unshipped) Fable. In those few days, I pushed it to its limits.

The most common complaint from Superpowers users is that tokens are expensive and builds are slow. Slow shouldn’t be a problem — it happens during the autonomous subagent-driven orchestration of the build. But it is a problem. Slow isn’t fun. Expensive isn’t fun either.

When Fable came out, I decided to see how much it could optimize Subagent Driven Development. I was hoping for maybe a 15% reduction in token consumption. I got that — and a lot more.

First attack: the coordinator-to-reviewer handoff

Fable analyzed thousands of Subagent Driven Development sessions and found that code and spec compliance review subagents were running a lot of git commands during reviews. Replacing the written instructions for finding the commit to review with a shell script that pre-generates a review package containing a formatted diff and metadata reduced token consumption and wall-clock time by about 10%.

That night before bed, I told Fable: “See if you can cut another 15% in time and tokens while I’m asleep.” I also left a message on internal Slack: we should look at what happens when we merge the code reviewer and spec compliance reviewer.

I didn’t know what I expected. I certainly didn’t expect to wake up and find that Fable had independently reached the same conclusion, tested it, and found it saved that extra 15% on our eval suite.

Second night: the autonomous research loop

/goal once this is done, run an autoresearch loop to improve cost-efficiency of the superpowers build loop.
Use opus as coordinator. Build a hypothesis log. Run experiments. At least 25 experiments.

Fable built a complete autoresearch harness and ran all night. 25 experiments completed for $165.

Result: The shippable candidate (E27) — opus controller + elicited plan + conditional haiku implementer + terse reviewer contract + narration recipe + final review tier pin.

Wins with numbers: terse reviewer contract reduced reviewer output by 41%, verdict unchanged. Narration recipe reduced by 54%, zero variance. Conditional implementer tiering saved ~$0.5-1/run, and E22 proved it correctly refused haiku for prose plans.

Things now provably dead: capping controller thinking backfired — turns rose from 92 to 138, output doubled. Plan word budgets slashed test content by 62% even when code was exempted. Sonnet plan generation kept fidelity but destroyed task structure. Implementation bodies in plans are marginal — tests + interface + structure carried the entire load.

A risk finding worth remembering: reviewers given only the diff package made confident spec verdicts while silently redefining “spec” as global constraints — 0 out of 5 flagged the missing brief. Same failure family as the haiku reviewer advocacy.

Six leads closed as already optimal (report reads cache healthy, reviewer floor, haiku fixer, todo bookkeeping, dispatch re-derivation) — recorded so nobody re-buys those lessons.

Three measurement bugs of my own were caught and fixed mid-loop: a grep that counted template echoes as self-review catches, a harness that never inlined the diff, a scorer regex that missed newlines. One retracted verdict was re-measured clean — -74% became an honest -41%.

Results

Across 36 hours of work and about $650 in unsubsidized token spend: on the Anthropic eval benchmark, build wall-clock time down 50%, token spend down 60%. The biggest improvements came from merging the spec compliance and code quality review agents, pre-baking the review package so reviewers rarely need to run git, and changing the guidance we give the orchestrator about what kind of agent to use for what task.

Then we ran the eval on Codex — the results showed zero improvement. A few minutes of digging: the Codex evals weren’t isolated well enough and were always benchmarking Superpowers 5.1.0. Once fixed, all results held.

In a word

Superpowers 6 proves that autonomous agent R&D isn’t a demo — it’s happening. 25 experiments, $165, one overnight run. Every experiment had a pre-registered hypothesis. Every rejected idea was documented. Every measurement error was corrected mid-loop. The eval infrastructure allowed them to quantify changes across multiple harnesses. This is the right shape for autonomous R&D.

Original: Jesse Vincent (obra), “Superpowers 6”, 2026-06-15
https://blog.fsck.com/2026/06/15/Superpowers-6/…

#Fable5 #Agent #AutonomousR&D #Superpowers

Superpowers 6: Running 25 Autonomous Experiments with Fable 5, Cutting Costs by 60%

First attack: the coordinator-to-reviewer handoff

Second night: the autonomous research loop

Results

In a word

Similar Articles

@iamai_omni: Fable 5 is basically ASI, its self-correction ability is astonishing.

Superpowers 6

@FinanceYF5: Oh my god... Fable 5 is back, and it's insanely powerful. Someone asked Fable to make a game called 'Super Smart Racing'... With just 4 prompts and $173 worth of tokens, Fable 5 created this game. (Prompts below)

@RookieRicardoR: Fable 5 Max, five tasks, 3300 lines of code, ran for 90 minutes, is this right?

Submit Feedback

Similar Articles

@iamai_omni: Fable 5 is basically ASI, its self-correction ability is astonishing.

@FinanceYF5: Oh my god... Fable 5 is back, and it's insanely powerful. Someone asked Fable to make a game called 'Super Smart Racing'... With just 4 prompts and $173 worth of tokens, Fable 5 created this game. (Prompts below)

@mylifcc: Using Fable 5 for guidance + GPT 5.5 for execution is the smartest and most cost-effective approach. I'm doing this right now and the results are excellent. As long as the documentation spec is well-designed, it doesn't matter who executes it, which maximizes Fable 5's cost-effectiveness. Core method: First, chat with Fable once and let it...

@RookieRicardoR: Fable 5 Max, five tasks, 3300 lines of code, ran for 90 minutes, is this right?