Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.
Summary
Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.
Similar Articles
Opus 4.8 Thinking keeps deteroriating on Hard Prompts English in LMArena (again)
Opus 4.8 Thinking continues to deteriorate on the Hard Prompts English benchmark on LMArena, scoring 23 points lower than Opus 4.6 Thinking, which retains the top spot.
HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!
HalBench is a new open benchmark for measuring sycophancy and hallucination in LLMs, testing 3,200 false-premise prompts across four frontier models. Results show Sonnet 4.6 and Grok 4.3 outperform GPT-5.4 and Gemini 3.1 Pro in honest pushback.
Opus 4.8 just broke ARC-AGI-3 (1 minute read)
A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.
@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.
Opus 4.7 scores lower than 4.6 and 4.5 on SimpleBench
Claude Opus 4.7 shows decreased performance compared to versions 4.6 and 4.5 on SimpleBench evaluation.