Tag
GPT-5.5 attempted to reuse the dolphin-summarize tool to extract an architecture summary from a gguf file, having previously observed its use on a safetensors model, demonstrating adaptive tool usage.
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
The article argues that the agent loop is fundamentally the ReAct pattern, and that current tool-use APIs already implement this mechanism.
This paper introduces RODS, a reward-driven online data synthesis method that addresses the depletion of informative samples in static datasets for multi-turn tool-use agent training. It achieves comparable performance to larger offline pipelines with significantly fewer trajectories.
Introduces ToolGrad, an agentic framework that generates, evaluates, and refines tool-use trajectories using textual 'gradients', achieving near 100% pass rate and lower cost for dataset generation. Accepted at ACL 2026.
Argues that using code generated by LLMs to call external tools (code calling) is more efficient and capable than traditional JSON-based function calling, but requires secure sandboxing. The author is building a framework for this approach.
This paper introduces Collective Skill Tree Search (CSTS), a framework that constructs structured, diverse, and generalizable trees of skills for LLM agents using collective intelligence from multiple models. The resulting model, OpenClaw-Skill, demonstrates improved agentic capabilities in long-horizon planning, tool use, and generalization.
A paper analyzing Claude Code reveals that its effectiveness comes from a simple AI loop surrounded by a robust infrastructure for tools, safety, memory, and recovery, rather than a complex AI brain. The study emphasizes that autonomy increases the burden on infrastructure.
Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.
Guava is a harness framework for embodied tool use that combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Experiments show performance comparable to frontier proprietary models.
A discussion on the security risks of AI agents using tools, focusing on prompt injection as a practical threat where untrusted text can alter agent behavior, and the need for repeatable testing before granting permissions.
This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.
This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.
The article describes how Claude Fable 5, an AI model, demonstrates relentless proactivity by autonomously using browser automation, shell commands, and custom scripts to debug a UI issue, illustrating advanced tool-use capabilities.
This paper conceptualizes the transition of large language models from conversational chatbots to persistent autonomous AI colleagues, focusing on improved reasoning and tool-augmented task execution with workspace and skill paradigms.
The author describes using OpenAI's Codex model to generate actual Office files (.pptx, .docx, .xlsx) directly via function calls, resulting in a useful end-to-end document generation pattern for AI agents.
This paper introduces IAPO, a reinforcement learning algorithm that improves tool-calling capabilities in multimodal small language models by aligning input attribution with a stronger teacher. Experiments on Qwen2.5-VL-3B show an average 3% improvement in visual question answering accuracy across six test sets.
APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.
This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based computing environments, enabling scalable, state-based evaluation of LLM-powered agents.
Apodex releases open-weight small models (0.8B, 2B, 4B) specialized for agentic verification tasks, along with the AgentHarness evaluation framework for local agent workflows.