Claude Opus 4.8 scores over 1% on ARC-AGI 3 !!
Summary
Claude Opus 4.8 achieves a score of over 1% on the ARC-AGI 3 benchmark, demonstrating slight progress on a difficult AI reasoning test.
Similar Articles
Claude Opus 4.8 says it's the only model that finished every case on the Super-Agent benchmark. Anyone run it on real agents yet?
Anthropic released Claude Opus 4.8, claiming it is the only model to complete every case on the Super-Agent benchmark and that it outperforms GPT-5.5 on browser/computer use tasks with better tool efficiency and fewer uncorrected code flaws.
Opus 4.8 just broke ARC-AGI-3 (1 minute read)
A new benchmark called LisanBench evaluates LLMs on word chain tasks requiring planning, memory, and constraint adherence, with results showing strong performance from o3 and Anthropic models.
Introducing Claude Opus 4.7
Anthropic has released Claude Opus 4.7, a new AI model featuring significant improvements in advanced software engineering, vision capabilities, and self-verification. The release includes specific cybersecurity safeguards and is available via API and major cloud providers.
@orca_build: Anthropic’s new Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1… …but it’s noticeably better at UI tasks.…
Anthropic's Opus 4.8 scores 3.6% lower than GPT 5.5 on Terminal-Bench 2.1 but excels at UI tasks; Orca's orchestration enables Codex to delegate UI tasks to Claude Code.
@mfpiccolo: Opus 4.8 is out. Here is the the verdict from @iiidevs lead engineer: did a stress test it’s just another llm can’t rea…
Anthropic released Claude Opus 4.8, an incremental update over Opus 4.7 with sharper judgment and longer autonomous work capability, though some engineers remain skeptical about its code generation without extensive guidance.