Tag
Rhys Sullivan is building Executor, an open-source integration layer for AI agents that provides a unified tool catalog with access controls, approval flows for destructive actions, and support for MCP, OpenAPI, GraphQL, and more. It aims to standardize tool calling across different agents like Cursor and Claude Code.
BioTool introduces a comprehensive biomedical tool-calling dataset with 34 tools and 7,040 human-verified query-API pairs, enabling fine-tuned LLMs to outperform GPT-5.1 on biomedical tool use and significantly enhance answer quality.
Stanford professor released a free 1-hour lecture covering the fundamentals of AI agents, tool calling, multi-step workflows, planning and reflection.
IBM releases Granite-4.1-8B, an Apache 2.0 licensed 8B parameter long-context instruct model with enhanced tool-calling and multilingual support.
Moonshot has open-sourced the Kimi K2.6 model, supporting 4,000 tool calls in a single session and 300 parallel sub-agents, achieving SOTA on benchmarks like SWE-Bench Pro and claiming performance on par with Claude Opus 4.6 and GPT-5.4.
PolicyBank proposes a memory mechanism that enables LLM agents to autonomously refine their understanding of organizational policies through iterative interaction and corrective feedback, closing specification gaps that cause systematic behavioral divergence from true requirements. The work introduces a systematic testbed and demonstrates PolicyBank can close up to 82% of policy-gap alignment failures, significantly outperforming existing memory mechanisms.
OpenAI announced new tools and features for the Responses API, including support for remote Model Context Protocol (MCP) servers, image generation, Code Interpreter, and improved file search capabilities. The update also enables o3 and o4-mini models to call tools directly within their chain-of-thought, with new enterprise features like background mode and encrypted reasoning items.