@eliebakouch: to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the mode…

X AI KOLs Following 06/22/26, 06:10 AM Products

closed-source orchestrator router multi-agent test-time-scaling sakana-ai critique

Summary

Elie Bakouch critiques Sakana AI's Fugu system as a closed-source orchestration layer over closed-source models, arguing it lacks transparency and true AI sovereignty, with technical limitations in routing and cost efficiency.

to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the models, now you don't even control which ones are used or how much. this is not "AI sovereignty" i've also read the tech report to get an opinion on the technical stuff: fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it's likely the opposite. they also have an autoresearch benchmark where they compare to frontier models "Model A, B and C" which is really crazy to not be transparent about what models you compare against. let's also say that this probably doesn't support adding new llm out of the box since you need to retrain the classifier about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows". my understanding of workflows is that they say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C" which is just a test time scaling compute strategy. i think this is an okish way to do it, but it's limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one) the biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task the good comparison here is not with opus, but it's opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release

Original Article

View Cached Full Text

Cached at: 06/23/26, 01:43 AM

to be clear, this is a closed source orchestrator on top of closed source models. if before you didn’t control the models, now you don’t even control which ones are used or how much. this is not “AI sovereignty”

i’ve also read the tech report to get an opinion on the technical stuff:

fugu (not the ultra version) is basically a classifier that selects which model at each turn is most likely to answer correctly (in other words a router). this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight. argument could be that it reduces cost, but no information about this so it’s likely the opposite. they also have an autoresearch benchmark where they compare to frontier models “Model A, B and C” which is really crazy to not be transparent about what models you compare against. let’s also say that this probably doesn’t support adding new llm out of the box since you need to retrain the classifier

about fugu ultra, this is basically and advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple “workflows”. my understanding of workflows is that they say: “spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C” which is just a test time scaling compute strategy. i think this is an okish way to do it, but it’s limited by the fact that they need to predict everything before the agents start working, which is why they limit this to 5 steps. imo you need to predict what to spawn at t+1 with the information you get at t, not with the info you get at t=0. there are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about which model is in the LLM pool (they only mention closed source api one)

the biggest and most obvious issue is that they are introducing a “test time scaling” method with “best of N” over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task

the good comparison here is not with opus, but it’s opus with ultracode/workflows enable, not with kimi, but with kimi swarm ect.. very very confusing release

Sakana AI (@SakanaAILabs): Introducing Sakana Fugu: A full multi-agent orchestration system accessible via a single model API.

Our ‘Fugu Ultra’ model matches the performance of Fable and Mythos, delivering frontier capability without the risk of export controls.

Try it: https://t.co/aDEFyySWlS 🐡

@eliebakouch: to be clear, this is a closed source orchestrator on top of closed source models. if before you didn't control the mode…

Similar Articles

@amitiitbhu: https://x.com/amitiitbhu/status/2069023290182758497

@sashimikun_void: @serenaa_ge Deepswe benchmark pls

Sakana Fugu

@DeRonin_: HOLY SH*T, got released Fable-class model in public from Japan by coding and research benchmarks it's literally equival…

@rohanpaul_ai: Sakana Fugu Ultra just beat the other models on visual polish in a live trading-desk coding test, got close to GLM 5.2,…

Submit Feedback

Similar Articles

@amitiitbhu: https://x.com/amitiitbhu/status/2069023290182758497

@sashimikun_void: @serenaa_ge Deepswe benchmark pls

@DeRonin_: HOLY SH*T, got released Fable-class model in public from Japan by coding and research benchmarks it's literally equival…

@rohanpaul_ai: Sakana Fugu Ultra just beat the other models on visual polish in a live trading-desk coding test, got close to GLM 5.2,…