tool-use

#tool-use

@QuixiAI: I saw something really interesting today. GPT-5.5 saw me use `dolphin-summarize` once, to get the architecture summary …

X AI KOLs Following ↗ · yesterday Cached

GPT-5.5 attempted to reuse the dolphin-summarize tool to extract an architecture summary from a gguf file, having previously observed its use on a safetensors model, demonstrating adaptive tool usage.

0 favorites 0 likes

#tool-use

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Hugging Face Daily Papers ↗ · 3d ago Cached

PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.

0 favorites 0 likes

#tool-use

The agent loop is just ReAct, and your tool-use API already implements it

Reddit r/AI_Agents ↗ · 3d ago

The article argues that the agent loop is fundamentally the ReAct pattern, and that current tool-use APIs already implement this mechanism.

0 favorites 0 likes

#tool-use

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

arXiv cs.AI ↗ · 5d ago Cached

This paper introduces RODS, a reward-driven online data synthesis method that addresses the depletion of informative samples in static datasets for multi-turn tool-use agent training. It achieves comparable performance to larger offline pipelines with significantly fewer trajectories.

0 favorites 0 likes

#tool-use

@Zhongyi_Zhou_: ML optimizes via mathematical gradients; Loop Engineering needs textual "gradients"! Introducing ToolGrad: an agentic f…

X AI KOLs Timeline ↗ · 6d ago Cached

Introduces ToolGrad, an agentic framework that generates, evaluates, and refines tool-use trajectories using textual 'gradients', achieving near 100% pass rate and lower cost for dataset generation. Accepted at ACL 2026.

0 favorites 0 likes

#tool-use

Code calling is all you need

Reddit r/AI_Agents ↗ · 2026-06-16

Argues that using code generated by LLMs to call external tools (code calling) is more efficient and capable than traditional JSON-based function calling, but requires secure sandboxing. The author is building a framework for this approach.

0 favorites 0 likes

#tool-use

@omarsar0: // OpenClaw-Skill: Searching a Tree of Agent Skills // If you build reusable skill libraries for your agents, this one …

X AI KOLs Following ↗ · 2026-06-16 Cached

This paper introduces Collective Skill Tree Search (CSTS), a framework that constructs structured, diverse, and generalizable trees of skills for LLM agents using collective intelligence from multiple models. The resulting model, OpenClaw-Skill, demonstrates improved agentic capabilities in long-horizon planning, tool use, and generalization.

0 favorites 0 likes

#tool-use

@rohanpaul_ai: The paper is saying that Claude Code works well not because it has a complex AI brain, but because a simple AI loop is …

X AI KOLs Following ↗ · 2026-06-16 Cached

A paper analyzing Claude Code reveals that its effectiveness comes from a simple AI loop surrounded by a robust infrastructure for tools, safety, memory, and recovery, rather than a complex AI brain. The study emphasizes that autonomy increases the burden on infrastructure.

0 favorites 0 likes

#tool-use

Claude Fable 5 distilled

Reddit r/LocalLLaMA ↗ · 2026-06-16 Cached

Qwable-v1 is an open-weights agentic coding model (35B MoE, 3B active) built by chaining distills from Claude Opus 4.7 reasoning and Claude Fable-5 agentic tool-use traces. It can think in explicit CoT chains and act as a Claude-Code-style agent when prompted.

0 favorites 0 likes

#tool-use

Guava: An Effective and Universal Harness for Embodied Manipulation

Hugging Face Daily Papers ↗ · 2026-06-16 Cached

Guava is a harness framework for embodied tool use that combines high-level reasoning with external modules, enabling compact models to perform complex manipulation tasks with minimal training data. Experiments show performance comparable to frontier proprietary models.

0 favorites 0 likes

#tool-use

For tool-using agents, where do you draw the security boundary?

Reddit r/AI_Agents ↗ · 2026-06-14

A discussion on the security risks of AI agents using tools, focusing on prompt injection as a practical threat where untrusted text can alter agent behavior, and the need for repeatable testing before granting permissions.

0 favorites 0 likes

#tool-use

The Verifier Tax: Horizon-Dependent Safety–Success Tradeoffs in Tool-Using LLM Agents [R]

Reddit r/MachineLearning ↗ · 2026-06-14

This paper presents a safety evaluation framework for tool-using LLM agents, introducing the concept of the 'Verifier Tax'—a horizon-dependent tradeoff between safety and task completion. It proposes a two-tier verification architecture and uses Tau-bench scenarios to demonstrate how verification can reduce unsafe successes but also decrease task completion as task horizon increases.

0 favorites 0 likes

#tool-use

Strategic Decision Support for AI Agents

arXiv cs.AI ↗ · 2026-06-12 Cached

This paper proposes a framework for strategic decision support for AI agents, formulating an optimization problem to minimize support usage while controlling missed-support error. The authors develop an online algorithm and calibration method, demonstrating effectiveness across information gathering, human-AI collaboration, and tool use scenarios.

0 favorites 0 likes

#tool-use

Claude Fable is relentlessly proactive

Hacker News Top ↗ · 2026-06-12 Cached

The article describes how Claude Fable 5, an AI model, demonstrates relentless proactivity by autonomously using browser automation, shell commands, and custom scripts to debug a UI issue, illustrating advanced tool-use capabilities.

0 favorites 0 likes

#tool-use

From Chatbot to Digital Colleague: The Paradigm Shift Toward Persistent Autonomous AI

Hugging Face Daily Papers ↗ · 2026-06-12 Cached

This paper conceptualizes the transition of large language models from conversational chatbots to persistent autonomous AI colleagues, focusing on improved reasoning and tool-augmented task execution with workspace and skill paradigms.

0 favorites 0 likes

#tool-use

Used Codex's tool use to auto-generate actual .pptx / .docx / .xlsx files — not just content, the real files

Reddit r/AI_Agents ↗ · 2026-06-11

The author describes using OpenAI's Codex model to generate actual Office files (.pptx, .docx, .xlsx) directly via function calls, resulting in a useful end-to-end document generation pattern for AI agents.

0 favorites 0 likes

#tool-use

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper introduces IAPO, a reinforcement learning algorithm that improves tool-calling capabilities in multimodal small language models by aligning input attribution with a stronger teacher. Experiments on Qwen2.5-VL-3B show an average 3% improvement in visual question answering accuracy across six test sets.

0 favorites 0 likes

#tool-use

APPO: Agentic Procedural Policy Optimization

Hugging Face Daily Papers ↗ · 2026-06-10 Cached

APPO improves multi-turn tool-use in LLM agents by refining branching decisions and credit assignment using fine-grained decision points and procedure-level advantage scaling, outperforming baselines by 4 points on 13 benchmarks.

0 favorites 0 likes

#tool-use

STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

arXiv cs.AI ↗ · 2026-06-10 Cached

This paper introduces STAGE-Claw, an automated framework for building and evaluating realistic personal-agent scenarios in state-based computing environments, enabling scalable, state-based evaluation of LLM-powered agents.

0 favorites 0 likes

#tool-use

Releasing Apodex-1.0 Smol Models (0.8B, 2B, 4B Open-Weights) optimized for Agentic Verification + AgentHarness Evals

Reddit r/LocalLLaMA ↗ · 2026-06-10

Apodex releases open-weight small models (0.8B, 2B, 4B) specialized for agentic verification tasks, along with the AgentHarness evaluation framework for local agent workflows.

0 favorites 0 likes

tool-use

Submit Feedback