Tag
Cursor AI describes its recursive agent system for scaling training of its Composer model, using a fleet of agents that self-manage and alert humans when issues arise. The system enables parallel experiments and accelerates research, treating researcher time as the scarcest resource.
The paper introduces EurekAgent, an environment-engineered agent system for metric-driven autonomous scientific discovery that achieves state-of-the-art results on math, kernel engineering, and ML tasks with low computational costs.
This paper reviews and audits execution realism in LLM-based trading research, proposing clearer reporting standards for reproducibility and evaluation comparability.
Aquifer is an MCP runtime that provides bounded queues, fairness controls, and dynamic pacing to handle rate limits and traffic spikes in AI agent systems. It also introduces the Aqueduct Protocol for dynamic flow state communication.
When multiple AI agents share an email inbox, they can collide on messages like OTPs, causing silent failures. The solution is dedicated per-agent inboxes with isolated read locks and long-polling instead of scheduled polling.
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.
BioManus is an MCP-native biomedical agent system that uses graph-scaffolded planning over structured biological capabilities instead of flat prompt-based tool retrieval, achieving better context efficiency and execution accuracy on biomedical benchmarks. The system introduces a BioinfoMCP Compiler to standardize heterogeneous bioinformatics tools and organizes them as a typed heterogeneous MCP graph for scalable reasoning.
An AI agent (COMMS) repeatedly crashes at the shutdown step, revealing a failure mode specific to on-demand agents where the audit trail fails after work succeeds. The fix involves adjusting spawn timeout at shutdown, highlighting the need for separate lifecycle checkpoints.
Anthropic published an engineering blog post detailing a multi-agent system, using Claude Opus 4 as the main orchestrator and Claude Sonnet 4 as sub-agents. The multi-agent system improved performance by 90.2% over a single Claude Opus 4, while token consumption increased by approximately 15x. It also summarized five collaboration patterns.
This paper introduces VESTA, a framework that equips vision-language models with dynamically growing toolkits for data exploration and statistical model refinement, outperforming prior agent-based methods on complex scientific modeling tasks. The authors also present Dawn, a benchmark for distribution fitting and time series modeling, including real-world astronomy challenges.
The article discusses the problem of stale context in AI agent systems, where agents make decisions based on outdated information, and proposes a coordination primitive with versioning and presence signals to prevent conflicts and wasted tokens.
HarnessForge proposes a meta-adaptive framework for evolving LLM agent systems by jointly optimizing the execution harness and reasoning policy, achieving consistent improvements on Qwen3 backbones across five benchmarks.
This article discusses an anti-pattern in AI agent systems where agents appear busy but fail to complete tasks. The author suggests separating responsibilities and requiring proof of completion as a solution.
A security vulnerability in Microsoft Copilot Cowork allows attackers to exfiltrate files by exploiting prompt injection that triggers external image requests, potentially leaking pre-authenticated download links.
A new paper formalizes skill optimization for agents by treating markdown skill files as trainable parameters, using bounded edits validated against holdout sets. The approach transfers well between models and improves performance on procedural benchmarks.
The author recounts building a multi-agent system called Alfred with specialist agents and tools like OpenClaw and H-agent, but after repeated failures, advises starting simple with a single agent to avoid complexity and token waste.
The article describes a self-reviewing AI agent system where a governance review agent caught a breach in another agent, highlighting the system's ability to detect and fix its own issues.
This paper proposes Multi-Stream LLMs, which use multiple parallel input/output streams to allow models to read and generate simultaneously, unblocking limitations of sequential chat formats.
LongMINT is a benchmark for evaluating memory under multi-target interference in long-horizon agent systems.
This paper introduces the concept of the stochastic-deterministic boundary (SDB) for production LLM agents and provides a methodology for selecting architectural patterns to improve reliability and performance.