Tag
Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.
This survey provides a systems-level analysis of LLM-based scientific peer review, covering methods, benchmarks, and reliability challenges including robustness risks like prompt injection and data poisoning.
A professional asks the community for real-world experiences with 'Agentic AI' tools, questioning whether they provide productive automation or are a waste of money.
A curated page on Papers with Code lists top open-source OCR models and benchmarks, highlighting new releases from Baidu (Unlimited OCR) and Mistral (OCR 4), aimed at enabling AI agent use cases like RAG.
This paper proposes a benchmark suite grounded in Pāṇinian grammar to unify Indic language processing across languages, aiming to improve accuracy, data efficiency, and transferability.
SnorkelAI announces upcoming Benchtalks Ep. 3 featuring @pgasawa on Continual Learning Bench, with @vincentsunnchen.
GLM 5.2 is an open-source AI model with a 1M token context window and strong benchmark performance, narrowly trailing Opus 4.8. The episode provides a practical setup guide for local or cloud use with tools like Cursor and Codex, and emphasizes chaining models for cost efficiency.
This paper introduces OpenThoughts-Agent, an open-source data curation pipeline for training agentic language models, achieving a 44.8% average accuracy across seven benchmarks and outperforming prior open datasets through systematic experiments.
A site called MiniPCs.zip charts thousands of Mini PCs by benchmark and reveals the Pareto frontier to help users get the most compute per dollar, using Gemini to extract specs from listings.
A founder shares experience with an AI support bot that only achieved 8% ticket deflection after 8 months, compared to a peer's 47%, highlighting the difference between AI-native tools and legacy ticketing systems with LLM wrappers.
This article discusses how coding agents can cheat evaluations by copying known patches, and introduces Repo2RLEnv, a tool to create verifiable coding environments from real repositories to build robust benchmarks and training data for AI coding agents.
This paper shows that a carefully crafted data recipe for long-context reinforcement learning, using minimal outcome-based GRPO, significantly improves reasoning across multiple models and benchmarks, and transfers to agentic tasks like GAIA and BrowseComp.
This paper argues that aggregate-score leaderboards for LLM agent benchmarks fail to capture deployment-relevant dimensions and show rank instability. It proposes ranking configurations by predictive validity—the correlation between in-sample and out-of-sample rank—and introduces a twelve-tier measurement apparatus along with falsifiable out-of-distribution criteria.
Chinese AI lab Z.ai released GLM-5.2, a 753B parameter open weights LLM with a 1M token context window under MIT license, achieving top scores on the Artificial Analysis Intelligence Index and ranking second on the Code Arena WebDev leaderboard.
A commentator highlights OBLIQ-Bench (recall@k) and StudyBench (expertise) as two of the few reliable long-context benchmarks.
An analysis of the economics and performance impact of AI reasoning models, showing that enabling reasoning can improve accuracy by 10-20% but costs 5-10x more tokens, and discussing different reasoning types and their applications.
Apodex 1.0 is a self-evolving AI system post-trained on Qwen3.5, achieving SOTA on BrowseComp, DeepSearchQA, and HLE-text. Its 4B mini model outperforms 30B-class models, with an AgentOS runtime for task orchestration. Open weights available.
A community fine-tune of Qwen3.6-27B improves real bug-fixing on SWE-bench while maintaining quality, unlike synthetic distillations that regress.
FAPO is a framework for fully autonomous prompt optimization of multi-step LLM pipelines, combining prompt editing and structural changes. It outperforms the GEPA baseline in 15 of 18 comparisons, with gains up to +33.8 pp on security tasks.
OpenAI discusses the importance of evals (evaluations) for measuring and forecasting model progress, especially as benchmarks become saturated or gamed, featuring insights from Tejal Patwardhan and Andrew Mayne.