@nick_kango: One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried tha…
Summary
Nick Kang adds a new task to his Twitter benchmark collection; Claude Opus 4.8 and other SOTA models pass, while Sonnet 4.6 and Grok 4.3 fail. Alfin remarks on Opus 4.8's dangerous capabilities.
View Cached Full Text
Cached at: 05/31/26, 11:06 AM
One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried that, but sonnet 4.6 & Grok 4.3 didn’t https://kaggle.com/benchmarks/tasks/nicholaskanggoog/days-with-d-puzzle…
Alfin (@AlfinCodes): Claude Opus 4.8 is insane.
Nothing will be the same after this model.
Anthropic should not have released something this dangerous.
Similar Articles
New SOTA: Poetiq uses self-optimizing harness to surpass e.g. Opus 4.7 with Gemini 3 Flash
Poetiq claims new state-of-the-art coding performance using a self-optimizing harness with Gemini 3 Flash, surpassing Opus 4.7.
@KKaWSB: Moonshot just open-sourced Kimi K2.6—4,000 tool calls in one 12-hour session, 300 sub-agents in parallel building a full codebase. SOTA on SWE-Bench Pro, BrowseComp, HLE and more, ties Claude Opus 4.6 and G…
Moonshot has open-sourced the Kimi K2.6 model, supporting 4,000 tool calls in a single session and 300 parallel sub-agents, achieving SOTA on benchmarks like SWE-Bench Pro and claiming performance on par with Claude Opus 4.6 and GPT-5.4.
@bentossell: wait… if most people think 5.5 is better than 4.7, i assume that’s due to terminal coding benchmark… 4.8 is still outpe…
The tweet discusses the release of Claude Opus 4.8, which improves upon Opus 4.7 with sharper judgment and longer independent work, though it notes that version 5.5 still outperforms it on a terminal coding benchmark.
@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…
Anthropic's Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1 but excels at UI tasks; Orca's orchestration enables Codex to delegate UI tasks to Claude Code.
@0xSero: Anyone else notice opus-4.8 is worse than it was on launch? They chopped him.
User observes that the opus-4.8 model has degraded in performance since its launch.