Opus 4.8 Thinking keeps deteroriating on Hard Prompts English in LMArena (again)

Reddit r/singularity 06/07/26, 12:40 AM News

model-performance benchmark hard-prompts opus-thinking lma-arena degradation

Summary

Opus 4.8 Thinking continues to deteriorate on the Hard Prompts English benchmark on LMArena, scoring 23 points lower than Opus 4.6 Thinking, which retains the top spot.

Opus 4.6 Thinking keeps the #1 spot. Followed by Opus 4.7 Thinking (-15 points). Lastly, Opus 4.8 Thinking (-23 points compared to 4.6 Thinking). [https://arena.ai/leaderboard/text/hard-prompts-english](https://arena.ai/leaderboard/text/hard-prompts-english) As a non-coder, I find the Hard Prompts (English) benchmark on LMArena to be the one that best matches my experience at work. It's probably more immune to benchmaxxing. Simple Bench also shows that 4.6 is the best model in the Opus family.

Original Article

Similar Articles

Opus 4.7 scores lower than 4.6 and 4.5 on SimpleBench

Reddit r/singularity

Claude Opus 4.7 shows decreased performance compared to versions 4.6 and 4.5 on SimpleBench evaluation.

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.

@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…

X AI KOLs Timeline

Anthropic's Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1 but excels at UI tasks; Orca's orchestration enables Codex to delegate UI tasks to Claude Code.

@mfpiccolo: Opus 4.8 is out. Here is the the verdict from @iiidevs lead engineer: did a stress test it’s just another llm can’t rea…

X AI KOLs Timeline

Anthropic released Claude Opus 4.8, an incremental update over Opus 4.7 with sharper judgment and longer autonomous work capability, though some engineers remain skeptical about its code generation without extensive guidance.

@datacurve: Opus 4.8 is now on DeepSWE. On the default high thinking effort, it scores 6% higher than Opus 4.7 xhigh, while also lo…

X AI KOLs Following

Opus 4.8 is now available on DeepSWE, scoring 6% higher than Opus 4.7 with reduced average cost per task.