Tag
A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.