Tag
This paper investigates why multi-step tool-use reinforcement learning (RL) often collapses or yields limited gains, identifying probability spikes in control tokens as a key cause. It shows that interleaving supervised fine-tuning with RL improves stability and explores various supervisory signals to guide robust training.
Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.
Gemini 3.5 Flash now natively supports computer use as a built-in tool, enabling developers to build agents that can interact across browser, mobile, and desktop environments for long-horizon automation tasks like software testing and knowledge work.
A user reports that the evaluation cost for their AI agent tripled after adding four tools, seeking optimization advice.
Qwen released Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with 3B active parameters, designed as a language world model to simulate environment responses for agent interactions across seven domains including MCP, terminal, SWE, Android, web, and OS.
This paper examines the reliability of exact-match retrieval recall as a proxy for downstream policy classification performance in long-horizon tool-use agents. Experiments with Qwen2.5 classifiers on τ-bench show that low clause recall does not significantly degrade classifier accuracy, suggesting that retrieval metrics alone can mislead when evaluating policy signal.
GPT-5.5 attempted to reuse the dolphin-summarize tool to extract an architecture summary from a gguf file, having previously observed its use on a safetensors model, demonstrating adaptive tool usage.
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
The article argues that the agent loop is fundamentally the ReAct pattern, and that current tool-use APIs already implement this mechanism.
This paper introduces RODS, a reward-driven online data synthesis method that addresses the depletion of informative samples in static datasets for multi-turn tool-use agent training. It achieves comparable performance to larger offline pipelines with significantly fewer trajectories.
Introduces ToolGrad, an agentic framework that generates, evaluates, and refines tool-use trajectories using textual 'gradients', achieving near 100% pass rate and lower cost for dataset generation. Accepted at ACL 2026.
Argues that using code generated by LLMs to call external tools (code calling) is more efficient and capable than traditional JSON-based function calling, but requires secure sandboxing. The author is building a framework for this approach.
This paper introduces Collective Skill Tree Search (CSTS), a framework that constructs structured, diverse, and generalizable trees of skills for LLM agents using collective intelligence from multiple models. The resulting model, OpenClaw-Skill, demonstrates improved agentic capabilities in long-horizon planning, tool use, and generalization.
A paper analyzing Claude Code reveals that its effectiveness comes from a simple AI loop surrounded by a robust infrastructure for tools, safety, memory, and recovery, rather than a complex AI brain. The study emphasizes that autonomy increases the burden on infrastructure.
Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.
Guava is a harness framework for embodied tool use that combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Experiments show performance comparable to frontier proprietary models.
A discussion on the security risks of AI agents using tools, focusing on prompt injection as a practical threat where untrusted text can alter agent behavior, and the need for repeatable testing before granting permissions.
This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.
This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.
The article describes how Claude Fable 5, an AI model, demonstrates relentless proactivity by autonomously using browser automation, shell commands, and custom scripts to debug a UI issue, illustrating advanced tool-use capabilities.