Opus 4.8 Thinking keeps deteroriating on Hard Prompts English in LMArena (again)

Reddit r/singularity News

Summary

Opus 4.8 Thinking continues to deteriorate on the Hard Prompts English benchmark on LMArena, scoring 23 points lower than Opus 4.6 Thinking, which retains the top spot.

Opus 4.6 Thinking keeps the #1 spot. Followed by Opus 4.7 Thinking (-15 points). Lastly, Opus 4.8 Thinking (-23 points compared to 4.6 Thinking). [https://arena.ai/leaderboard/text/hard-prompts-english](https://arena.ai/leaderboard/text/hard-prompts-english) As a non-coder, I find the Hard Prompts (English) benchmark on LMArena to be the one that best matches my experience at work. It's probably more immune to benchmaxxing. Simple Bench also shows that 4.6 is the best model in the Opus family.
Original Article

Similar Articles

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.