Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity 04/20/26, 09:53 PM Models

llm-benchmark debate evaluation claude leaderboard ai-competition

Summary

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.

More info, transcripts, model profiles, comparisons: [https://github.com/lechmazur/debate](https://github.com/lechmazur/debate) Models debate the same motion twice with sides swapped. Opus 4.7 often wins by finding the hinge of the debate, dragging the whole exchange back to it, and forcing the other model to defend on its terms. Each completed debate is judged by a three-model panel. Panels avoid same-family judges against the debaters.

Original Article

Similar Articles

Opus 4.8 Thinking keeps deteroriating on Hard Prompts English in LMArena (again)

Reddit r/singularity

Opus 4.8 Thinking continues to deteriorate on the Hard Prompts English benchmark on LMArena, scoring 23 points lower than Opus 4.6 Thinking, which retains the top spot.

HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next!

Reddit r/LocalLLaMA

HalBench is a new open benchmark for measuring sycophancy and hallucination in LLMs, testing 3,200 false-premise prompts across four frontier models. Results show Sonnet 4.6 and Grok 4.3 outperform GPT-5.4 and Gemini 3.1 Pro in honest pushback.

Opus 4.8 just broke ARC-AGI-3 (1 minute read)

TLDR AI

A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.

@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…

X AI KOLs Timeline

Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.

Opus 4.7 scores lower than 4.6 and 4.5 on SimpleBench