@nick_kango: One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried tha…

X AI KOLs Timeline 05/30/26, 06:52 PM News

twitter-benchmark model-evaluation benchmark sota opus claude grok sonnet

Summary

Nick Kang adds a new task to his Twitter benchmark collection; Claude Opus 4.8 and other SOTA models pass, while Sonnet 4.6 and Grok 4.3 fail. Alfin remarks on Opus 4.8's dangerous capabilities.

One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried that, but sonnet 4.6 & Grok 4.3 didn't https://kaggle.com/benchmarks/tasks/nicholaskanggoog/days-with-d-puzzle…

Original Article

View Cached Full Text

Cached at: 05/31/26, 11:06 AM

One more task to add to my twitter benchmark collection:) Btw, Opus 4.8 and all the SOTA models passed when i tried that, but sonnet 4.6 & Grok 4.3 didn’t https://kaggle.com/benchmarks/tasks/nicholaskanggoog/days-with-d-puzzle…

Alfin (@AlfinCodes): Claude Opus 4.8 is insane.

Nothing will be the same after this model.

Anthropic should not have released something this dangerous.

Similar Articles

New SOTA: Poetiq uses self-optimizing harness to surpass e.g. Opus 4.7 with Gemini 3 Flash

Reddit r/singularity

Poetiq claims new state-of-the-art coding performance using a self-optimizing harness with Gemini 3 Flash, surpassing Opus 4.7.

@KKaWSB: Moonshot just open-sourced Kimi K2.6—4,000 tool calls in one 12-hour session, 300 sub-agents in parallel building a full codebase. SOTA on SWE-Bench Pro, BrowseComp, HLE and more, ties Claude Opus 4.6 and G…

X AI KOLs Timeline

Moonshot has open-sourced the Kimi K2.6 model, supporting 4,000 tool calls in a single session and 300 parallel sub-agents, achieving SOTA on benchmarks like SWE-Bench Pro and claiming performance on par with Claude Opus 4.6 and GPT-5.4.

@bentossell: wait… if most people think 5.5 is better than 4.7, i assume that’s due to terminal coding benchmark… 4.8 is still outpe…

X AI KOLs Following

The tweet discusses the release of Claude Opus 4.8, which improves upon Opus 4.7 with sharper judgment and longer independent work, though it notes that version 5.5 still outperforms it on a terminal coding benchmark.

@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…

X AI KOLs Timeline

Anthropic's Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1 but excels at UI tasks; Orca's orchestration enables Codex to delegate UI tasks to Claude Code.

@0xSero: Anyone else notice opus-4.8 is worse than it was on launch? They chopped him.