@LM_Braswell: Confirmed LLMs now much better than room of avid Anagram players - can you figure out where to put the last I?
Summary
LLMs now outperform a room of proficient anagram players, as demonstrated in a recent evaluation.
View Cached Full Text
Cached at: 06/10/26, 09:55 PM
Confirmed LLMs now much better than room of avid Anagram players - can you figure out where to put the last I? https://t.co/s1NAMImYP7
Similar Articles
Can LLM Teams Play What? Where? When?
This paper investigates whether team-based interaction improves LLM performance in the quiz game 'What? Where? When?' (ChGK). Using six recent open LLMs on a 2025 dataset of 572 questions, they show that team strategies (voting, silent captain, talkative captain) outperform single models by up to 20 percentage points, with the best team achieving 44.23% accuracy, approaching human performance.
Can LLMs Adhere to Strict 2D Spatial Constraints? (Testing with Sokoban)
A benchmark tests LLMs on strict Sokoban puzzles with formatting constraints, finding only ChatGPT, Qwen3.7-max, and Gemini 3.5-thinking succeed, while others fail due to illegal moves or formatting errors.
@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.
Age of LLM: A Strategic 1v1 Benchmark for Reasoning, Diplomacy and Reliability of Large Language Models under Fog of War
Introduces Age of LLM, a turn-based 1v1 benchmark where LLMs compete on a grid with fog of war and diplomacy, measuring reasoning, reliability, and strategic planning. Findings show a dominance of nuclear rush tactics and a weak link between reliability and winning.
Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment
This paper investigates whether standard benchmarks underestimate LLM performance by re-evaluating hallucination detection datasets using an LLM-first, human-adjudicated assessment method. The study finds that incorporating LLM reasoning into the adjudication process improves agreement and suggests that model-assisted re-evaluation yields more reliable benchmarks for ambiguity-prone tasks.