benchmarks

#benchmarks

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv cs.CL ↗ · 5h ago Cached

Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.

0 favorites 0 likes

#benchmarks

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

arXiv cs.CL ↗ · 5h ago Cached

This survey provides a systems-level analysis of LLM-based scientific peer review, covering methods, benchmarks, and reliability challenges including robustness risks like prompt injection and data poisoning.

0 favorites 0 likes

#benchmarks

How are companies evaluating "Agentic AI" tools right now? Are they seeing productive workflow automation results or just a waste of money?

Reddit r/AI_Agents ↗ · 6h ago

A professional asks the community for real-world experiences with 'Agentic AI' tools, questioning whether they provide productive automation or are a waste of money.

0 favorites 0 likes

#benchmarks

Find the best open-source OCR models in one place at Papers with Code [P]

Reddit r/MachineLearning ↗ · 16h ago

A curated page on Papers with Code lists top open-source OCR models and benchmarks, highlighting new releases from Baidu (Unlimited OCR) and Mistral (OCR 4), aimed at enabling AI agent use cases like RAG.

0 favorites 0 likes

#benchmarks

A P\={a}ninian Foundation for Indic Language Processing

arXiv cs.CL ↗ · yesterday Cached

This paper proposes a benchmark suite grounded in Pāṇinian grammar to unify Indic language processing across languages, aiming to improve accuracy, data efficiency, and transferability.

0 favorites 0 likes

#benchmarks

@SnorkelAI: Benchtalks Ep. 3 with @pgasawa (Continual Learning Bench): coming soon with @vincentsunnchen

X AI KOLs Timeline ↗ · yesterday Cached

SnorkelAI announces upcoming Benchtalks Ep. 3 featuring @pgasawa on Continual Learning Bench, with @vincentsunnchen.

0 favorites 0 likes

#benchmarks

@startupideaspod: https://x.com/startupideaspod/status/2069494373604282771

X AI KOLs Timeline ↗ · yesterday Cached

GLM 5.2 is an open-source AI model with a 1M token context window and strong benchmark performance, narrowly trailing Opus 4.8. The episode provides a practical setup guide for local or cloud use with tools like Cursor and Codex, and emphasizes chaining models for cost efficiency.

0 favorites 0 likes

#benchmarks

OpenThoughts-Agent: Data Recipes for Agentic Models

Hugging Face Daily Papers ↗ · 2d ago Cached

This paper introduces OpenThoughts-Agent, an open-source data curation pipeline for training agentic language models, achieving a 44.8% average accuracy across seven benchmarks and outperforming prior open datasets through systematic experiments.

0 favorites 0 likes

#benchmarks

Show HN: MiniPCs.zip – Charting the Pareto frontier of Mini PCs

Hacker News Top ↗ · 4d ago

A site called MiniPCs.zip charts thousands of Mini PCs by benchmark and reveals the Pareto frontier to help users get the most compute per dollar, using Gemini to extract specs from listings.

0 favorites 0 likes

#benchmarks

AI support vendor quoted 40% deflection, called 8% normal after 8 months

Reddit r/artificial ↗ · 6d ago

A founder shares experience with an AI support bot that only achieved 8% ticket deflection after 8 months, compared to a peer's 47%, highlighting the difference between AI-native tools and legacy ticketing systems with LLM wrappers.

0 favorites 0 likes

#benchmarks

@adithya_s_k: https://x.com/adithya_s_k/status/2067628584680710292

X AI KOLs Timeline ↗ · 6d ago Cached

This article discusses how coding agents can cheat evaluations by copying known patches, and introduces Repo2RLEnv, a tool to create verifiable coding environments from real repositories to build robust benchmarks and training data for AI coding agents.

0 favorites 0 likes

#benchmarks

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv cs.CL ↗ · 2026-06-18 Cached

This paper shows that a carefully crafted data recipe for long-context reinforcement learning, using minimal outcome-based GRPO, significantly improves reasoning across multiple models and benchmarks, and transfers to agentic tasks like GAIA and BrowseComp.

0 favorites 0 likes

#benchmarks

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.

0 favorites 0 likes

#benchmarks

GLM-5.2 is probably the most powerful text-only open weights LLM

Simon Willison's Blog ↗ · 2026-06-17 Cached

Chinese AI lab Z.ai released GLM-5.2, a 753B parameter open weights LLM with a 1M token context window under MIT license, achieving top scores on the Artificial Analysis Intelligence Index and ranking second on the Code Arena WebDev leaderboard.

0 favorites 0 likes

#benchmarks

@lateinteraction: At this point in time, two of the extremely few long-context benchmarks I'd assign any weight at all to are OBLIQ-Bench…

X AI KOLs Following ↗ · 2026-06-17

A commentator highlights OBLIQ-Bench (recall@k) and StudyBench (expertise) as two of the few reliable long-context benchmarks.

0 favorites 0 likes

#benchmarks

@cerebras: https://x.com/cerebras/status/2067357992929153268

X AI KOLs Timeline ↗ · 2026-06-17 Cached

An analysis of the economics and performance impact of AI reasoning models, showing that enabling reasoning can improve accuracy by 10-20% but costs 5-10x more tokens, and discussing different reasoning types and their applications.

0 favorites 0 likes

#benchmarks

@heyshrutimishra: Apodex 1.0 dropped and the architecture is genuinely different. It's post-trained on Qwen3.5 as a self-evolving system:…

X AI KOLs Following ↗ · 2026-06-17 Cached

Apodex 1.0 is a self-evolving AI system post-trained on Qwen3.5, achieving SOTA on BrowseComp, DeepSearchQA, and HLE-text. Its 4B mini model outperforms 30B-class models, with an AgentOS runtime for task orchestration. Open weights available.

0 favorites 0 likes

#benchmarks

@witcheer: this is the first Qwen3.6-27B coding tune I've measured that improves real bug-fixing (!!!). - quality (MMLU/ARC/HellaS…

X AI KOLs Timeline ↗ · 2026-06-17 Cached

A community fine-tune of Qwen3.6-27B improves real bug-fixing on SWE-bench while maintaining quality, unlike synthetic distillations that regress.

0 favorites 0 likes

#benchmarks

FAPO: Fully Autonomous Prompt Optimization of Multi-Step LLM Pipelines

Hugging Face Daily Papers ↗ · 2026-06-17 Cached

FAPO is a framework for fully autonomous prompt optimization of multi-step LLM pipelines, combining prompt editing and structural changes. It outperforms the GEPA baseline in 15 of 18 comparisons, with gains up to +33.8 pp on security tasks.

0 favorites 0 likes

#benchmarks

@OpenAI: Let’s talk about evals. We’re always looking for better ways to measure and forecast model progress, especially as benc…

X AI KOLs ↗ · 2026-06-16 Cached

OpenAI discusses the importance of evals (evaluations) for measuring and forecasting model progress, especially as benchmarks become saturated or gamed, featuring insights from Tejal Patwardhan and Andrew Mayne.

0 favorites 0 likes

benchmarks

Submit Feedback