Opus 4.8 just broke ARC-AGI-3 (1 minute read)

TLDR AI 06/02/26, 12:00 AM Tools

benchmark llm-evaluation reasoning language-models word-chain levenshtein forward-planning

Summary

A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.

It tripled GPT-5.5's score.

Original Article

View Cached Full Text

Cached at: 06/02/26, 03:43 PM

# Thread by @scaling01 on Thread Reader App Source: [https://threadreaderapp.com/thread/2061513383287882111.html](https://threadreaderapp.com/thread/2061513383287882111.html) Introducing LisanBench LisanBench is a simple, scalable, and precise benchmark designed to evaluate large language models on knowledge, forward\-planning, constraint adherence, memory and attention, and long context reasoning and "stamina"\. "I see possible futures, all at once\. Our enemies are all around us, and in so many futures they prevail\. But I do see a way, there is a narrow way through\." \- Paul Atreides How it works: Models are given a starting English word and must generate the longest possible sequence of valid English words\. Each subsequent word in the chain must: \- Differ from the previous word by exactly one letter $Levenshtein distance =1$ \- Be a valid English word \- Not repeat any previously used word The benchmark repeats this process across multiple starting words of varying difficulty\. A model's final score is the cumulative length of its longest valid chains from the starting words\. Results: \- o3 is by far the best model, mainly because it is the only model that manages to escape from parts of the graph with very low connectivity and many dead\-ends $slight caveat: o3 was by far the most expensive one to run and used ~30\-40k reasoning tokens per starting word$ \- Opus 4 and Sonnet 4 with 16k reasoning tokens, also perform extremely, especially Opus which was able to beat o3 at 3 starting words with only one third of the reasoning tokens\! \- Claude 3\.7 with thinking taking 4th place ahead of o1 \- other OpenAI reasoning models perform all well, but size does make a difference\! o1 is ahead of o4\-mini high and o3\-mini \- Gemini models perform a bit worse than their Anthropic and OpenAI counterparts, but they have by far the longest outputs \- they are a bit delusional and keep yapping; they don't realize and stop when they made a mistake \- strongest non\-reasoning models: Grok\-3, GPT\-4\.5, Sonnet 3\.5 and 3\.7, Opus 4, Sonnet 4, DeepSeek\-V3 and Gemini 1\.5 Pro \- Grok 3, Sonnet 3\.5 and 3\.7 are a surprise\!\! Inspiration: LisanBench draws from benchmarks like AidanBench and SOLO\-Bench\. However, unlike AidanBench, it’s extremely cost\-effective, trivially verifiable and doesn't rely on an Embedding model \- the entire benchmark cost only ~$50 for 57 models\. And unlike SOLO\-Bench, it explicitly tests knowledge and applies stronger constraints, which makes it more challenging\! Verification: Verification uses the words\_alpha\.txt dictionary from[github\.com/dwyl/english\-w…](https://github.com/dwyl/english-words)$~370,105 words$, but for scalability, only words from the largest connected component $108,448 words$ are used\. Easy Scaling, Difficulty Adjustment & Accuracy improvements: \- Scaling and Accuracy: Just add more starting words or increase the number of trials per word\. \- Difficulty: Starting words vary widely \- from those with 72 neighbors to those with just 1 \- effectively distinguishing between moderately strong and elite models\. Difficulty can also be gauged via local connectivity and branching factor\. Why is it challenging? LisanBench uniquely stresses: \- Forward planning: avoiding dead ends by strategic word choices \- models must find the narrow way through \- Knowledge: wide vocabulary is essential \- Memory and Attention: previously used words must not be repeated \- Precision: strict adherence to Levenshtein constraints \- Long\-context reasoning: coherence and constraint\-tracking over hundreds of steps \- Output stamina: some models break early during long generations — LisanBench exposes that, which is critical for agentic use cases The two beautiful plots below show that the starting words are very different in difficulty\. Some are in low connectivity regions, some in high\-connectivity regions and others are just surrounded by dead\-ends\! Just as Paul Atreides had to navigate the political, cultural, and metaphysical maze of his destiny, LLMs in LisanBench must explore vast word graphs, searching for the Golden Path \- the longest viable chain without collapse\. We will know the chosen model when it appears\. It will be the one that finds the Golden Path and avoids every dead end\. Right now, for the most difficult starting word "abysmal", the longest chain found is just 2, although it is also part of the \>100k connected component\. So there is a narrow way through\! More plots with full leaderboard below\![![Image](https://threadreaderapp.com/images/1px.png)](https://pbs.twimg.com/media/GsNnQxuW0AAHBlk.png) [![Image](https://threadreaderapp.com/images/1px.png)](https://pbs.twimg.com/media/GsNnslOW0AAa2KB.jpg) [![Image](https://threadreaderapp.com/images/1px.png)](https://pbs.twimg.com/media/GsNnzziWYAAQZ4Z.jpg) [![Image](https://threadreaderapp.com/images/1px.png)](https://pbs.twimg.com/media/GsNrFSWWYAE8o7O.png) It is worse than AidanBench in one regard\. Because it operates on a word / character level and not on a sentence / paragraph level it is affected by tokenization\! So models with better tokenizers, all else being equal, should perform better\. And I only tested 10 words, doing 25 or 50 for good measure would probably help with stability\.

Opus 4.8 just broke ARC-AGI-3 (1 minute read)

Similar Articles

Opus 5 benchmarks (30.2% on ARC-AGI3!!!)

Claude Opus 4.8 scores over 1% on ARC-AGI 3 !!

Opus 5 ARC AGI score was benchmaxxed

ARC-AGI Leaderboard

New LLM Coordination Benchmark - Benchmarking Open-Ended Multi-Agent Coordination in Language Agents [R]

Submit Feedback

Similar Articles

Opus 5 benchmarks (30.2% on ARC-AGI3!!!)
Opus 5 achieves 30.2% on the ARC-AGI3 benchmark, marking a notable performance improvement.

Claude Opus 4.8 scores over 1% on ARC-AGI 3 !!
Claude Opus 4.8 achieves a score of over 1% on the ARC-AGI 3 benchmark, demonstrating slight progress on a difficult AI reasoning test.

Opus 5 ARC AGI score was benchmaxxed

New LLM Coordination Benchmark - Benchmarking Open-Ended Multi-Agent Coordination in Language Agents [R]