Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity Models

Summary

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.

More info, transcripts, model profiles, comparisons: [https://github.com/lechmazur/debate](https://github.com/lechmazur/debate) Models debate the same motion twice with sides swapped. Opus 4.7 often wins by finding the hinge of the debate, dragging the whole exchange back to it, and forcing the other model to defend on its terms. Each completed debate is judged by a three-model panel. Panels avoid same-family judges against the debaters.
Original Article

Similar Articles

Opus 4.8 just broke ARC-AGI-3 (1 minute read)

TLDR AI

A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.