I don’t believe this benchmark 27b size model next opus 4.5! Anyone can confirm testing with real agentic workflow?

Reddit r/LocalLLaMA Models

Summary

A 27B parameter model reportedly outperforms Opus 4.5 on a benchmark, prompting community skepticism and requests for real-world agentic workflow validation.

No content available
Original Article

Similar Articles

Opus 4.8 just broke ARC-AGI-3 (1 minute read)

TLDR AI

A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.