I don’t believe this benchmark 27b size model next opus 4.5! Anyone can confirm testing with real agentic workflow?
Summary
A 27B parameter model reportedly outperforms Opus 4.5 on a benchmark, prompting community skepticism and requests for real-world agentic workflow validation.
Similar Articles
@bcherny: Seeing a number of benchmarks showing Opus is the best model for long-running work. Five tips for running Opus autonomo…
Practical tips for running Anthropic's Claude Opus autonomously for hours or days, such as using auto mode, dynamic workflows, and self-verification; also references the SWE-Marathon benchmark for long-horizon software tasks.
Claude Opus 4.8 says it's the only model that finished every case on the Super-Agent benchmark. Anyone run it on real agents yet?
Anthropic released Claude Opus 4.8, claiming it is the only model to complete every case on the Super-Agent benchmark and that it outperforms GPT-5.5 on browser/computer use tasks with better tool efficiency and fewer uncorrected code flaws.
@VibeMarketer_: life when you discover an open-source model that runs 300 parallel agents, executes for 12+ hours straight, beats GPT-5…
An unnamed open-source model runs 300 parallel agents for 12+ hours and reportedly outperforms GPT-5.4 and Opus 4.6 on several benchmarks, with weights available on Hugging Face.
Opus 4.8 just broke ARC-AGI-3 (1 minute read)
A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.
@LottoLabs: Interesting model here 35b a3b trained for agentic use It gets 60.7 on Terminal Bench2 qwen 3.6 27b gets 59.3 Essential…
Nex-AGI releases Nex-N2, an open-source agentic model series (Nex-N2-Pro and Nex-N2-mini) with an Agentic Thinking framework that unifies reasoning, tool use, and environment execution, achieving top-tier performance on agentic and coding benchmarks.