tool-use

Tag

Cards List
#tool-use

Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It

arXiv cs.CL · 20h ago Cached

This paper investigates why multi-step tool-use reinforcement learning (RL) often collapses or yields limited gains, identifying probability spikes in control tokens as a key cause. It shows that interleaving supervised fine-tuning with RL improves stability and explores various supervisory signals to guide robust training.

0 favorites 0 likes
#tool-use

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv cs.CL · 20h ago Cached

Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.

0 favorites 0 likes
#tool-use

Introducing computer use in Gemini 3.5 Flash

Google DeepMind Blog · yesterday Cached

Gemini 3.5 Flash now natively supports computer use as a built-in tool, enabling developers to build agents that can interact across browser, mobile, and desktop environments for long-horizon automation tasks like software testing and knowledge work.

0 favorites 0 likes
#tool-use

Testmu eval cost jumped 3x after we added 4 tools to our agent. Anyone optimize this?

Reddit r/AI_Agents · yesterday

A user reports that the evaluation cost for their AI agent tripled after adding four tools, seeking optimization advice.

0 favorites 0 likes
#tool-use

Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments

Reddit r/LocalLLaMA · yesterday

Qwen released Qwen-AgentWorld-35B-A3B, a 35B-parameter MoE model with 3B active parameters, designed as a language world model to simulate environment responses for agent interactions across seven domains including MCP, terminal, SWE, Android, web, and OS.

0 favorites 0 likes
#tool-use

When Retrieval Metrics Mislead: Measuring Policy Signal in Long-Horizon Tool-Use Agents

arXiv cs.CL · yesterday Cached

This paper examines the reliability of exact-match retrieval recall as a proxy for downstream policy classification performance in long-horizon tool-use agents. Experiments with Qwen2.5 classifiers on τ-bench show that low clause recall does not significantly degrade classifier accuracy, suggesting that retrieval metrics alone can mislead when evaluating policy signal.

0 favorites 0 likes
#tool-use

@QuixiAI: I saw something really interesting today. GPT-5.5 saw me use `dolphin-summarize` once, to get the architecture summary …

X AI KOLs Following · 3d ago Cached

GPT-5.5 attempted to reuse the dolphin-summarize tool to extract an architecture summary from a gguf file, having previously observed its use on a safetensors model, demonstrating adaptive tool usage.

0 favorites 0 likes
#tool-use

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Hugging Face Daily Papers · 5d ago Cached

PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.

0 favorites 0 likes
#tool-use

The agent loop is just ReAct, and your tool-use API already implements it

Reddit r/AI_Agents · 5d ago

The article argues that the agent loop is fundamentally the ReAct pattern, and that current tool-use APIs already implement this mechanism.

0 favorites 0 likes
#tool-use

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv cs.AI · 2026-06-18 Cached

This paper introduces RODS, a reward-driven online data synthesis method that addresses the depletion of informative samples in static datasets for multi-turn tool-use agent training. It achieves comparable performance to larger offline pipelines with significantly fewer trajectories.

0 favorites 0 likes
#tool-use

@Zhongyi_Zhou_: ML optimizes via mathematical gradients; Loop Engineering needs textual "gradients"! Introducing ToolGrad: an agentic f…

X AI KOLs Timeline · 2026-06-17 Cached

Introduces ToolGrad, an agentic framework that generates, evaluates, and refines tool-use trajectories using textual 'gradients', achieving near 100% pass rate and lower cost for dataset generation. Accepted at ACL 2026.

0 favorites 0 likes
#tool-use

Code calling is all you need

Reddit r/AI_Agents · 2026-06-16

Argues that using code generated by LLMs to call external tools (code calling) is more efficient and capable than traditional JSON-based function calling, but requires secure sandboxing. The author is building a framework for this approach.

0 favorites 0 likes
#tool-use

@omarsar0: // OpenClaw-Skill: Searching a Tree of Agent Skills // If you build reusable skill libraries for your agents, this one …

X AI KOLs Following · 2026-06-16 Cached

This paper introduces Collective Skill Tree Search (CSTS), a framework that constructs structured, diverse, and generalizable trees of skills for LLM agents using collective intelligence from multiple models. The resulting model, OpenClaw-Skill, demonstrates improved agentic capabilities in long-horizon planning, tool use, and generalization.

0 favorites 0 likes
#tool-use

@rohanpaul_ai: The paper is saying that Claude Code works well not because it has a complex AI brain, but because a simple AI loop is …

X AI KOLs Following · 2026-06-16 Cached

A paper analyzing Claude Code reveals that its effectiveness comes from a simple AI loop surrounded by a robust infrastructure for tools, safety, memory, and recovery, rather than a complex AI brain. The study emphasizes that autonomy increases the burden on infrastructure.

0 favorites 0 likes
#tool-use

Claude Fable 5 distilled

Reddit r/LocalLLaMA · 2026-06-16 Cached

Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.

0 favorites 0 likes
#tool-use

Guava: An Effective and Universal Harness for Embodied Manipulation

Hugging Face Daily Papers · 2026-06-16 Cached

Guava is a harness framework for embodied tool use that combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Experiments show performance comparable to frontier proprietary models.

0 favorites 0 likes
#tool-use

For tool-using agents, where do you draw the security boundary?

Reddit r/AI_Agents · 2026-06-14

A discussion on the security risks of AI agents using tools, focusing on prompt injection as a practical threat where untrusted text can alter agent behavior, and the need for repeatable testing before granting permissions.

0 favorites 0 likes
#tool-use

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Reddit r/MachineLearning · 2026-06-14

This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.

0 favorites 0 likes
#tool-use

Strategic Decision Support for AI Agents

arXiv cs.AI · 2026-06-12 Cached

This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.

0 favorites 0 likes
#tool-use

Claude Fable is relentlessly proactive

Hacker News Top · 2026-06-12 Cached

The article describes how Claude Fable 5, an AI model, demonstrates relentless proactivity by autonomously using browser automation, shell commands, and custom scripts to debug a UI issue, illustrating advanced tool-use capabilities.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback