Tag
This Stanford research paper introduces AutoMem, a framework that treats agent memory management as a trainable skill. By optimizing memory structure and proficiency separately, AutoMem improves base agent performance 2x-4x on long-horizon tasks, enabling a 32B open-weight model to compete with frontier systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking.
AutoMem introduces a framework that automates learning of memory management as a trainable skill for LLMs, improving performance on long-horizon tasks by 2x-4x through optimizing memory structure and proficiency.
OSWorld 2.0 is a new benchmark for evaluating computer-use agents on 108 long-horizon, real-world workflows. Current agents like Claude Opus 4.8 and GPT-5.5 achieve low completion rates, highlighting significant limitations in handling complex, multi-step tasks.
GPT-5.6 is a capable model for long-horizon tasks and knowledge work across coding, computer use, and science.
Explains why self-reflection ReAct loops fail on long-horizon tasks and introduces the AgentOS verification architecture as a solution.
A user shares experience using FactoryAI to convert a design system from HTML/CSS to Flutter widgets with E2E testing. The tool employs an orchestrator, workers, and validators using multiple AI models to plan and execute long-horizon tasks over 79 hours, spawning over 229 agents.
Xiaomi open-sourced MiMo Code, an AI coding assistant with a novel memory architecture that outperforms Claude Code on long-horizon tasks, and includes free access to its MiMo-V2.5 model.
This paper presents HORMA, a hierarchical organize-and-retrieve memory agent that organizes agent experiences into a file-system-like structure for efficient retrieval, improving performance on long-horizon tasks while reducing token usage.
AdaCoM is a separate LLM that manages context for a frozen AI agent, improving performance on long tasks without retraining. It improved average web search performance by 39% in tests.
The paper proposes Signal-Driven Observation (SDO), a method for web agents to avoid context degradation by only reading task-relevant parts of the DOM and re-invoking observation only when triggered by specific signals, rather than reading the full page state at every action step.
CoMIC is a cloud-edge framework for LLM agents that uses collaborative memory and insight circulation to improve long-horizon task performance without requiring parameter updates, achieving gains in progress rate and action grounding across multiple tasks.
MemPro is a system-level evolution framework that treats the memory construction–retrieval pipeline as an evolvable program, using an Evolving Agent to iteratively diagnose failures and create improved versions. Experiments on long-horizon benchmarks show consistent improvement over static and prompt-level baselines with favorable performance–cost trade-off.
Tweet discussing advice on self-improving agents, with personal observations from experiments on coding agents for long-horizon tasks, noting that stronger models don't always yield better agents.
This paper introduces GTA, a scalable framework for automatically generating long-horizon, multi-hop web agent tasks with executable trajectories, addressing the lack of process-level supervision in web agent benchmarks. The framework integrates crawling, retrieval-based seeding, and automated quality control to produce realistic tasks across multiple websites.
Performed source code analysis and multi-model testing on the pi-goal tool, finding that DeepSeek V4 Pro is 31x cheaper and higher quality than Gemini 3.5 Flash on long-horizon tasks, and that higher thinking mode actually increases hallucination.
Zhipu AI founder Tang Jie predicts that the biggest breakthrough in large models this year will be long-horizon tasks, where AI can continuously solve complex problems in real environments, and mentions three technical pillars and Anthropic's progress in autonomous training.
This paper introduces Agent-BRACE, a method that decouples LLM agents into belief state and policy models to handle long-horizon tasks in partially observable environments. By verbalizing state uncertainty, it achieves significant performance improvements over baselines while maintaining constant context window size.
The article discusses the anticipated breakthrough in long-horizon AI tasks and autonomous agents, suggesting a shift from 'one-person' to 'none-person' companies. It highlights technical pillars like memory, continual learning, and self-judging as key to realizing fully self-evolving AI systems that could redefine AGI and operating systems.
This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.
This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.