Tag
A tweet promoting a 38-page PDF guide on building autonomous LLM agents, offering a free resource for learning about agentic AI systems.
This article details how a systematic fund replaced its traditional NLP pipeline with a RAG-based LLM agent architecture, achieving a 340% improvement in alpha generation from unstructured data. It cites recent research (Alpha-GPT 2.0, FinCon, FinAgent) showing significant gains in automated factor discovery and trading performance.
This article comprehensively reviews the complete architectural layering of AI Agent Memory as of mid-2026, including rule files, persistent profiles, historical recall, and evidence chains. It explains the storage methods, loading timings, and governance principles of different memory layers, emphasizing the key role of memory in helping agents achieve cross-session compounding work.
MEMPROBE is a benchmark that evaluates long-term memory in LLM agents by reconstructing hidden user states from the agent's memory after interaction.
ReM-MoA introduces a memory-augmented Mixture-of-Agents framework that sustains scaling through ranked reasoning memory and curated diversified memory routing, outperforming prior MoA variants across five reasoning benchmarks.
Presents LemonHarness, an integrated execution framework for long-horizon LLM agents that constrains state-changing operations within a clearly defined workspace, introduces a reusable rule knowledge base, and adds time-aware execution. Achieves 84-86% accuracy on Terminal-Bench 2.0.
Metis presents a controlled study comparing text and code memory for self-evolving agents, finding they have complementary trade-offs. It proposes a hierarchical dual-representation memory system that improves task accuracy by up to 20.6% and reduces execution cost by up to 22.8% on the AppWorld benchmark.
The author argues that the reliability of AI agents comes from deterministic code, not the LLM, and shares five key practices for building trustworthy agents on messy real-world data.
This paper proposes the EDV framework, which uses multiple heterogeneous agents in execute-distill-verify stages to build reliable experiences for LLM agents, preventing self-confirmatory errors and improving performance on long-horizon benchmarks.
A blog post summarizing ten recent agentic RL frameworks and best practices, covering modular interfaces, trajectory structure, action masks, process rewards, advantage normalization, scalable rollouts, stability/exploration, and task curriculum.
Explores how different agent architectures yield varying outputs from the same underlying model and prompt, highlighting the impact of agent design on LLM behavior.
This paper investigates whether LLM agents can infer hidden world models through interaction, finding that they struggle to build stable internal models as complexity increases.
This paper introduces representational commitment, a cross-run hidden-state convergence that diagnoses when an LLM agent has locked onto a trajectory prematurely. It shows that commitment predicts trajectory consistency but not correctness, and proposes monitoring to detect when an agent is confidently settled rather than assuming consistency equals trust.
CLI-Universe is a synthesis engine that generates verifiable terminal-agent tasks via multi-dimensional capability taxonomy and evidence-guided research, producing a distilled dataset of 6,000 trajectories. Fine-tuning Qwen3-32B on this dataset achieves 33.4% on Terminal-Bench 2.0, setting a new state-of-the-art for open-source models at or below 32B parameters.
Libretto introduces a structured framework for symbolic music generation and revision using an LLM-native grammar and corpus-calibrated statistical evaluation across musical dimensions, enabling LLM agents to treat music as a measurable and editable object.
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
ScaffoldAgent introduces a utility-guided dynamic outline optimization framework for open-ended deep research, using expansion, contraction, and revision operations to improve long-form report generation and factual grounding.
This paper explores autotelic AI, where agents generate their own goals, and discusses implications for intrinsic motivation, embeddedness, and the dissolution of the self boundary. It proposes a framework extending to quantum formulation, non-dual philosophy, and LLM-based instantiation.
Proposes Multi-Agent Transactive Memory (MATM), a framework for population-level storage and retrieval of agent-generated trajectories to improve task performance and reduce interaction steps in interactive environments like ALFWorld and WebArena.
This paper proposes a human-on-the-loop orchestration framework for AI-assisted legal discovery, introducing a taxonomy of agentic failures and a four-layer verification architecture to reduce privilege-waiver risk.