terminal-bench

#terminal-bench

@cline: The new Sonnet 5 achieves Opus 4.8 level performance on Terminal-Bench for less than half the cost. Importantly for --y…

X AI KOLs Following ↗ · 13h ago Cached

New Sonnet 5 model achieves Opus 4.8 level performance on Terminal-Bench at less than half the cost, with improved refusal of prompt injection attacks, now available in Cline.

0 favorites 0 likes

#terminal-bench

@MiaAI_lab: The Orinth-1.0-35b MoE looks way better than the Qwen 3.6 35b, especially in Terminal-Bench 2.1 & SWE Atlas.

X AI KOLs Timeline ↗ · 5d ago Cached

The Orinth-1.0-35b MoE outperforms Qwen 3.6 35b on Terminal-Bench 2.1 and SWE Atlas benchmarks.

0 favorites 0 likes

#terminal-bench

GLM-5.2 matched Claude Opus on 45 terminal-bench coding-agent tasks at less than half the cost (full methodology + failure transcripts inside)

Reddit r/ArtificialInteligence ↗ · 6d ago

GLM-5.2 matches Claude Opus on 45 coding-agent tasks at lower cost, with 43 of 45 tasks having identical outcomes.

0 favorites 0 likes

#terminal-bench

@cuisitekp: A 9B model outperforms models several times larger. The team behind OLMo/Tülu from Ai2 and the University of Washington released a new paper called Tmax, claiming it's the strongest open-source RL training recipe for 'terminal agents'. Result: A 9B model on Terminal-Be…

X AI KOLs Timeline ↗ · 6d ago Cached

Ai2 and the University of Washington released a paper titled Tmax, proposing the strongest open-source terminal agent RL training recipe to date. A 9B parameter model outperforms larger models on Terminal-Bench 2.0, with the key being low-cost generation of vast amounts of verifiable training data, not model size or algorithm.

0 favorites 0 likes

#terminal-bench

GLM-5.2 is the first open-weights model to cross 80% on Terminal-Bench and beats every other open model available

Reddit r/LocalLLaMA ↗ · 2026-06-16

GLM-5.2 is the first open-weights model to exceed 80% on Terminal-Bench, surpassing all other open models and even Gemini, making it a frontier-level model at a fraction of the cost.

0 favorites 0 likes

#terminal-bench

@ashwingop: https://x.com/ashwingop/status/2065080505113125105

X AI KOLs Timeline ↗ · 2026-06-11 Cached

Sentra's Code Memory system boosts GPT-5.5 to 88.31% on Terminal-Bench 2.1 at a quarter of the cost, outperforming Anthropic's restricted Mythos 5 model. The memory layer reduces input tokens by 52% and costs by 72.6% while improving task success rates.

0 favorites 0 likes

#terminal-bench

@LottoLabs: Interesting model here 35b a3b trained for agentic use It gets 60.7 on Terminal Bench2 qwen 3.6 27b gets 59.3 Essential…

X AI KOLs Following ↗ · 2026-06-08 Cached

Nex-AGI releases Nex-N2, an open-source agentic model series (Nex-N2-Pro and Nex-N2-mini) with an Agentic Thinking framework that unifies reasoning, tool use, and environment execution, achieving top-tier performance on agentic and coding benchmarks.

0 favorites 0 likes

#terminal-bench

An agent harness written in rust, 100 % self-contained, and topped terminal bench

Reddit r/AI_Agents ↗ · 2026-06-05

Ante is a lightweight, self-contained terminal agent harness written in Rust, designed to be fast and dependency-free. It topped Terminal Bench 2.0 and remains highly responsive to user feedback despite being in preview and not yet open-sourced.

0 favorites 0 likes

#terminal-bench

@rohanpaul_ai: Meta paper shows that coding agents get much better when they reuse short summaries of past attempts instead of raw log…

X AI KOLs Following ↗ · 2026-05-23 Cached

A Meta paper shows that coding agents improve significantly when they reuse short summaries of past attempts instead of raw logs, achieving strong gains on SWE-Bench and Terminal-Bench with Claude 4.5 Opus.

0 favorites 0 likes

#terminal-bench

Qwen3.6-35B-A3B and 9B are officially on the public Terminal-Bench 2.0 leaderboard!

Reddit r/LocalLLaMA ↗ · 2026-05-16

Qwen3.6-35B-A3B and Qwen3.5-9B models are officially on the Terminal-Bench 2.0 leaderboard, with little-coder achieving 24.6% on the 35B variant, surpassing Gemini 2.5 Pro and Qwen3-Coder-480B, while the 9B model shows that sub-10B local models can compete on hard agentic benchmarks.

0 favorites 0 likes

#terminal-bench

@samhogan: https://x.com/samhogan/status/2055064462844219603

X AI KOLs Timeline ↗ · 2026-05-14 Cached

HALO uses RLMs to optimize AI agent harnesses by analyzing execution traces and suggesting improvements, achieving 10%+ gains on several benchmarks like Terminal-Bench and AppWorld.

0 favorites 0 likes

terminal-bench

Submit Feedback