leaderboard

#leaderboard

@arena: Arena Trends: Text-to-Image, Jan 2026 – Apr 2026 For most of the year, @GoogleDeepMind and @OpenAI traded the top spot …

X AI KOLs Following ↗ · 2026-04-21 Cached

GPT-Image-2 surges to 1,512 on the Arena text-to-image leaderboard, opening a 242-point gap over rivals from Google DeepMind and OpenAI.

0 favorites 0 likes

#leaderboard

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Hugging Face Blog ↗ · 2026-04-21 Cached

QIMMA is a new quality-first Arabic LLM leaderboard introduced by TII UAE that validates benchmarks before evaluation to ensure accurate performance measurement. It addresses systematic quality issues in existing Arabic NLP benchmarks through a rigorous multi-stage validation pipeline.

0 favorites 0 likes

#leaderboard

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

Reddit r/singularity ↗ · 2026-04-20

Opus 4.7 has taken the #1 spot on the LLM Debate Benchmark, surpassing Sonnet 4.6 by 106 BT points with a perfect record of 51 wins, 4 ties, and 0 losses in side-swapped matchups. The model wins by identifying and controlling the central hinge of debates, forcing opponents onto its terms.

0 favorites 0 likes

#leaderboard

@sumeetrm: LongCoT is adding two new leaderboards! Due to the interest in agents (particularly RLMs), we’re adding a “Restricted H…

X AI KOLs Following ↗ · 2026-04-19 Cached

LongCoT introduces two new agent leaderboards (Restricted & Open Harness), with GPT 5.2 RLM topping the Open Harness at 25.12%.

0 favorites 0 likes

#leaderboard

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Google DeepMind Blog ↗ · 2024-12-17 Cached

DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.

0 favorites 0 likes

leaderboard

@arena: Arena Trends: Text-to-Image, Jan 2026 – Apr 2026 For most of the year, @GoogleDeepMind and @OpenAI traded the top spot …

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

Opus 4.7 (high) takes #1 on the LLM Debate Benchmark, leading the previous champion, Sonnet 4.6 (high), by 106 BT points. Incredibly, it has not lost a single completed side-swapped matchup: 51 wins, 4 ties, and 0 losses.

@sumeetrm: LongCoT is adding two new leaderboards! Due to the interest in agents (particularly RLMs), we’re adding a “Restricted H…

FACTS Grounding: A new benchmark for evaluating the factuality of large language models

Submit Feedback