Sakana Fugu

Hacker News Top 06/22/26, 02:08 AM Products

multi-agent model-orchestration api sakana-ai llm-coordination research-backed

Summary

Sakana Fugu dynamically orchestrates a diverse pool of top models to tackle complex, multi-step tasks via a single API, leveraging their ICLR 2026 papers on learned orchestration to achieve frontier-level performance without single-vendor dependency.

No content available

Original Article

View Cached Full Text

Cached at: 06/22/26, 04:30 AM

# Sakana Fugu — Multi-Agent System as A Model Source: https://sakana.ai/fugu/ One Model to Command Them All One Model to Command Multi-Agents Frontier-level performance without single-vendor dependency. Fugu dynamically orchestrates the world's best models to tackle complex, multi-step tasks. Plug collective intelligence directly into your workflows today with a single API. Sakana Fugu dynamically orchestrates the world's top models and automatically solves complex, multi-step tasks. Integrate a high-performance API into your workflow. Not yet available in the EU/EEA while we work toward compliance with GDPR and EU-specific regulations. We are working on compliance with GDPR and other EU/EEA-specific regulations, and currently this service is not available in the EU/EEA. ## A Multi-Agent System, Delivered as One Model Providing a multi-agent system as a single model API Sakana Fugu achieves superior performance by dynamically coordinating and orchestrating a diverse pool of powerful models. Instead of using domain knowledge to prescribe team organization, roles, or workflows, Fugu learns to dynamically assemble agents from a pool and coordinate them through non-obvious but highly efficient collaboration patterns. Sakana Fugu achieves high performance by dynamically combining and coordinating a diverse pool of powerful models. It efficiently learns model organization, role assignment, and processing methods that humans would not think of, delivering results. Sakana Fugu architecture overview 01 ### One API to Access All in an Optimized Way Optimize multiple models with a single API Access a coordinated pool of specialized models through one API. Fugu handles model selection and switching for each task, reducing API complexity while improving cost-performance. You can use a pool of specialized models through a single API. Sakana Fugu handles model selection and switching for each task, reducing API complexity while improving cost performance. 02 ### Superior Performance on Complex Tasks Excellent performance on complex tasks Built for coding, reasoning, and other quality-critical workflows, Fugu coordinates expert agents to tackle complex tasks with stronger, more reliable results. Designed for coding, reasoning, and other high-quality workflows, Sakana Fugu coordinates expert agents to deliver stronger, more reliable answers to complex tasks. 03 ### Flexible Agent Selection Flexible agent selection Control which agents can participate in Fugu’s model pool. Opt out of specific providers or models to meet data, privacy, compliance, or organizational requirements. Choose which agents to include in Sakana Fugu’s model pool. You can exclude specific providers or models to meet data, privacy, compliance, or organizational requirements. ## Research-Driven Coordination for Multi-Agent Intelligence Coordination technology based on the latest research supporting multi-agent intelligence Sakana Fugu is grounded in two ICLR 2026 papers on learned model orchestration: TRINITY and the Conductor. Together, they show how systems can learn to assemble, route, and coordinate expert agents for each task instead of relying on hand-designed workflows. For a deeper look at the ideas behind the system, explore our technical report (https://github.com/SakanaAI/fugu/blob/main/Fugu_technical_report.pdf). Sakana Fugu is based on two ICLR 2026 papers on learned model orchestration: TRINITY and the Conductor. These studies show how systems can learn to assemble, route, and coordinate expert agents for each task instead of relying on hand-designed workflows. For details, see the technical report (https://github.com/SakanaAI/fugu/blob/main/Fugu_technical_report.pdf). Cover image for the TRINITY research paper. PAPER (https://arxiv.org/abs/2512.04695) ### TRINITY: An Evolved LLM Coordinator TRINITY: Evolved LLM Coordinator (https://arxiv.org/abs/2512.04695) Trinity uses a lightweight evolved coordinator to orchestrate multiple LLMs over several turns, assigning Thinker, Worker, or Verifier roles to adaptively delegate work across coding, math, reasoning, and knowledge tasks. TRINITY is a mechanism where a lightweight evolved coordinator oversees multiple LLMs over multiple turns, assigning roles such as Thinker, Worker, and Verifier to adaptively distribute tasks across coding, math, reasoning, and knowledge. Cover image for the Conductor research paper. PAPER (https://arxiv.org/abs/2512.04388) ### Learning to Orchestrate Agents in Natural Language with the Conductor Learning to orchestrate agents in natural language with the Conductor (https://arxiv.org/abs/2512.04388) The Conductor is trained with reinforcement learning to discover natural-language coordination strategies, designing agent communication patterns and focused prompts that help diverse LLM pools outperform individual workers on challenging reasoning benchmarks. The Conductor is trained with reinforcement learning to discover natural-language coordination strategies. It designs agent communication patterns and focused prompts, enabling diverse LLM pools to outperform individual workers on challenging reasoning benchmarks. ## Unlock Multi-Agent Intelligence Through An API Unlock multi-agent intelligence through an API Sakana Fugu comes in two models — **Fugu** and **Fugu Ultra** — both available through one OpenAI-compatible API. Pick the model that fits your workload, or switch between them without changing your integration. Sakana Fugu offers two models — **Fugu** and **Fugu Ultra** — both accessible via an OpenAI-compatible API. Choose the model that suits your workload, or switch between them without changing your integration. Fugu Balanced performance and latency Fugu balances strong performance with low latency, making it the ideal default for everyday work. Drop it into tools like Codex for coding and code review, or power responsive chatbot services — all behind a single endpoint. You can also opt specific agents out of its pool to meet data, privacy, and compliance constraints. Sakana Fugu balances strong performance with low latency, making it the ideal default for everyday work. Use it with tools like Codex for coding and code review, or power responsive chatbot services — all through a single endpoint. You can also exclude specific agents from its pool to meet data, privacy, and compliance constraints. Fugu Ultra Optimized for performance Fugu Ultra coordinates a deeper pool of expert agents to maximize answer quality on hard, high-stakes problems. Early users rely on it for Kaggle competitions, paper reproduction, cybersecurity analysis, and literature and patent investigations. Fugu Ultra coordinates a broader pool of expert agents to maximize answer quality on difficult, high-stakes problems. Early users rely on it for Kaggle competitions, paper reproduction, cybersecurity analysis, and literature/patent investigations. ## Quantitative Results Performance of Sakana Fugu: Quantitative Evaluation Our Fugu models surpass publicly accessible frontier models and are shoulder-to-shoulder with Fable 5 and Mythos Preview in various rigorous engineering, scientific, and reasoning benchmarks while delivering frontier capability without the risk of export controls. Our two Fugu models surpass publicly accessible frontier models and are on par with Fable 5 and Mythos Preview in various rigorous engineering, scientific, and reasoning benchmarks, delivering frontier capability without export control risks. Benchmark comparison chart Performance comparison of Fugu models and baseline frontier models across a suite of coding, reasoning, scientific, and agentic benchmarks. For Fable 5 and Mythos Preview, we report the max of the two if both scores are available on the same benchmark. Neither of them is in Fugu’s agent pool as they are not publicly accessible. Performance comparison of Fugu models and baseline frontier models across coding, reasoning, science, and agentic benchmarks. For Fable 5 and Mythos Preview, we report the higher score if both are available on the same benchmark. Neither is included in Fugu’s agent pool as they are not publicly accessible. Highest scores are shown in boldface; second-highest scores are underlined. Benchmark | Fugu | Fugu Ultra | Opus 4.8† | Gemini 3.1 Pro† | GPT 5.5† --- | --- | --- | --- | --- | --- SWE Bench Pro* | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 TerminalBench 2.1 | 80.2 | 82.1 | 74.6 | 70.2 | 78.2 LiveCodeBench | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 LiveCodeBench Pro | 87.8 | 90.8 | 84.8 | 82.9 | 88.4 Humanity’s Last Exam | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 CharXiv Reasoning | 85.1 | 86.6 | 84.2 | 83.4 | 84.1 GPQA-D | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 SciCode | 60.1 | 58.7 | 53.5 | 58.9 | 56.1 τ3 Banking | 21.7 | 20.6 | 20.6 | 8.4 | 20.6 Long Context Reasoning | 74.7 | 73.3 | 67.7 | 72.7 | 74.7 MRCRv2 | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 \* Using mini-swe-agent as scaffold. † Scores reported by model providers. ## Qualitative Results Performance of Sakana Fugu: Qualitative Examples These examples compare Sakana Fugu models with three frontier baselines — **Gemini 3.1 Pro (high)**, **Opus 4.8 (max)**, and **GPT 5.5 (xhigh)**. To keep the focus on behavior rather than brand attribution, the baselines are anonymized as **Model A**, **Model B**, and **Model C** in each description. **The mapping is intentionally not fixed across examples.** These examples compare Sakana Fugu models with three frontier baselines — **Gemini 3.1 Pro (high)**, **Opus 4.8 (max)**, and **GPT 5.5 (xhigh)**. To keep the focus on behavior rather than brand attribution, the baselines are anonymized as **Model A**, **Model B**, and **Model C** in each description. **The mapping is intentionally not fixed across examples.** This experiment shows an AI agent autonomously improving a small GPT's training recipe. Using AutoResearch (Karpathy et al.) – which iteratively edits training code, runs experiments, and keeps only changes that lower validation bits-per-byte (BPB) – the agent ran 123 experiments over ~14 hours on a single H100 GPU. Each line traces a system's best BPB as experiments accumulate: Fugu Ultra is in bold red (solid = mean over three seeds, dashed = best single run), with three frontier-model baselines (Model A, B, and C) faded behind it, and the callouts mark each new improvement the agent found on its own — spanning batch size, model depth, learning rates, and optimizer settings. Fugu Ultra finishes with the best mean BPB (0.9774 ± 0.0019), ahead of Model C (0.9781), Model B (0.9793), and Model A (0.9822), and its best single run reaches 0.9748, leading every baseline. This suggests that orchestrating multiple strong models can outperform any individual frontier model on agentic ML research. **Example 1 — AutoResearch / LLM Training** An experiment where an AI agent autonomously improves a small GPT's training recipe. Using the agentic framework AutoResearch (Karpathy et al.) — which iteratively edits training code, runs experiments, and retains only changes that lower validation bits-per-byte (BPB) — the agent conducted 123 experiments over ~14 hours on a single H100 GPU. Each line traces the best BPB achieved by each system as experiments accumulate. Fugu Ultra is shown in bold red (solid line = mean over three seeds, dashed line = best single run), with three frontier-model baselines (Model A, B, C) faded behind it. Callouts mark each new improvement the agent discovered on its own, covering batch size, model depth, learning rates, and optimizer settings. Fugu Ultra achieves the best mean BPB (0.9774 ± 0.0019), surpassing Model C (0.9781), Model B (0.9793), and Model A (0.9822), and its best single run reaches 0.9748, outperforming all baselines. This suggests that orchestrating multiple strong models can outperform any individual frontier model on agentic ML research. This case study tests whether the reading order of classical Japanese kana letters (仮名消息) can be recovered — letters whose scattered chirashigaki ("scattered-writing") layout makes that genuinely hard even for trained readers of classical Japanese. Each model is given the character bounding boxes together with a rough set of reading-order rules, and writes code that outputs the order the characters should be read in; here it runs on a letter written in 1610 by Hōshun'in (芳春院, 1547–1617), scored by NED (a score based on normalized edit distance from an expert's ground-truth order, where 1.0 is a perfect match). Several frontier models were put through the identical pipeline, but none came close to Fugu Ultra on this letter: Model A reached only NED 0.24 and Model B scored no better, both far below Fugu Ultra's 0.80, while Model C produced no predictor at all. The clip shows the two extremes — each panel draws its predicted path in red over the expert's ground truth in green: Fugu Ultra (top) traces the letter almost exactly, while Model A (bottom) jumps all over the page. (Letter held by the Keio Institute of Oriental Classics.) **Example 2 — Reading Order of Classical Japanese Kana Letters** This case study tests whether the reading order of classical Japanese kana letters (仮名消息) can be recovered — letters written in the "chirashigaki" (scattered writing) style, which makes it genuinely difficult even for trained classical Japanese readers. Each model is given character bounding boxes and a rough set of reading-order rules, and writes code to output the reading order. Here it runs on a letter written in 1610 by Hōshun'in (芳春院, 1547–1617), scored by NED (normalized edit distance from an expert's ground-truth order, where 1.0 is a perfect match). Several frontier models were put through the same pipeline, but none came close to Fugu Ultra on this letter: Model A reached only NED 0.24 and Model B scored no better, both far below Fugu Ultra's 0.80, while Model C produced no valid output at all. The clip shows the two extremes — each panel draws its predicted path in red over the expert's ground truth in green: Fugu Ultra (top) traces the letter almost exactly, while Model A (bottom) jumps all over the page. (Letter held by the Keio Institute of Oriental Classics.) In this benchmark, each of Fugu Ultra and 3 frontier models is given a single prompt to write a Rubik's Cube solver from scratch in pure Python — no off-the-shelf solving libraries allowed — and the resulting program is run locally on a held-out set of 300 randomly scrambled cubes. Solution quality is measured by the number of moves a solution uses, where lower is better. Fugu Ultra and the frontier Model A wrote solvers that ran and solved all 300 cubes, while Model B and Model C each shipped sophisticated-looking code that crashed on execution and returned no valid solution at all (0/300). The clip follows cube #17: from the same scramble, Fugu Ultra's solver reaches the solved state in 19 moves while Model A needs 21 — and across all 300 cubes Fugu Ultra averages 19.72 moves versus 19.76 for Model A, both right at the optimal frontier, with Fugu Ultra never a move longer than Model A on any cube (7 wins, 293 ties, 0 losses). **Example 3 — Rubik's Cube Solver** In this benchmark, each of Fugu Ultra and 3 frontier models is given a single prompt to write a Rubik's Cube solver from scratch in pure Python — no off-the-shelf solver libraries allowed — and the resulting program is run locally on a held-out set of 300 randomly scrambled cubes. Solution quality is measured by the number of moves, where fewer is better. Fugu Ultra and frontier Model A produced solvers that ran and solved all 300 cubes, while Model B and Model C each generated sophisticated-looking code that crashed on execution and returned no valid solution (0/300). The clip follows cube #17: from the same scramble, Fugu Ultra's solver reaches the solved state in 19 moves while Model A needs 21. Across all 300 cubes, Fugu Ultra averages 19.72 moves versus 19.76 for Model A, both at the optimal frontier, with Fugu Ultra never taking more moves than Model A on any cube (7 wins, 293 ties, 0 losses). Task: Create a mechanical iris in CAD, like a camera aperture, where multiple blades move together to open and close the central hole. For each model, we show both the generated detailed CAD itself and a simplified view that makes the structure easier to see. In the CAD generated by Fugu Ultra, the blades rotate around outer pins and clearly open and close the aperture. In contrast, the CAD generated by the other models shows problems such as gaps appearing, weak linkages, or the aperture not closing fully. **Example 4 — CAD Mechanical Iris** Task: Create a mechanical iris in CAD, like a camera aperture, where multiple blades move together to open and close the central hole. For each model, we show both the generated detailed CAD and a simplified view for clarity. In the CAD generated by Fugu Ultra, the blades rotate around outer pins and clearly open and close the aperture. In contrast, the CAD generated by the other models shows issues such as gaps, weak linkages, or the aperture not closing fully. Four blindfold chess games, back to back. Every model plays the same way — no board shown — holding the full game in memory. Fugu outplays four strong opponents: three leading frontier models and a 2100-Elo Stockfish engine, staying accurate where they drift and ending each game in checkmate. **Example 5 — Blindfold Chess** Four blindfold chess games played consecutively. All models play under the same conditions — no board is shown, and the full game is held in memory. Fugu outplays four strong opponents: three leading frontier models and a 2100-Elo Stockfish engine. Fugu remains accurate where the opponents drift and ends each game in checkmate. This benchmark uses a single anonymized equity over one historical 50-week window and is intended to compare sequential, no-look-ahead decision-making rather than to establish generalizable trading performance. Past performance does not guarantee future results, and results may not transfer to other assets, time periods, or live markets. Each model makes online trading decisions on anonymized STOCK_X, using only current and past weekly market data: opening, high, low, and closing prices, volume, returns, moving averages, volatility, drawdown, portfolio state, and prior feedback. Starting with $10,000, the agent chooses whether to buy, hold, or sell, and what fraction of cash or shares to trade. After each action, the next week's price is revealed and the portfolio is updated, so the model must adapt from feedback rather than seeing the future. Across five runs of the identical 50-week pipeline, Fugu Ultra grew the portfolio to $11,943.22 ± $633.86, a +19.43% mean return, while the other frontier models reached fewer than +15% return. **Example 6 — Stock Trading** This benchmark uses a single anonymized equity over one historical 50-week window and is intended to compare sequential, no-look-ahead decision-making rather than to establish generalizable trading performance. Past performance does not guarantee future results, and results may not transfer to other assets, time periods, or live markets. Each model makes online trading decisions on anonymized STOCK_X, using only current and past weekly market data: opening, high, low, and closing prices, volume, returns, moving averages, volatility, drawdown, portfolio state, and prior feedback. Starting with $10,000, the agent chooses whether to buy, hold, or sell, and what fraction of cash or shares to trade. After each action, the next week's price is revealed and the portfolio is updated, so the model must adapt from feedback rather than seeing the future. Across five runs of the identical 50-week pipeline, Fugu Ultra grew the portfolio to $11,943.22 ± $633.86, a +19.43% mean return, while the other frontier models achieved less than +15% return. ## What do our users think about Sakana Fugu? User reviews for Sakana Fugu 01 Software Engineer ##

Sakana Fugu

Similar Articles

@sashimikun_void: @serenaa_ge Deepswe benchmark pls

Sakana Fugu

Sakana Fugu (3 minute read)

@loretoparisi: The LLM Fusion era has just started.

@rohanpaul_ai: Sakana Fugu Ultra just beat the other models on visual polish in a live trading-desk coding test, got close to GLM 5.2,…

Submit Feedback

Similar Articles

@sashimikun_void: @serenaa_ge Deepswe benchmark pls

@loretoparisi: The LLM Fusion era has just started.

@rohanpaul_ai: Sakana Fugu Ultra just beat the other models on visual polish in a live trading-desk coding test, got close to GLM 5.2,…