long-horizon-tasks

#long-horizon-tasks

@omarsar0: // AutoMem // I quite like this idea of metamemory. (bookmark it) This new research from Stanford treats agent's memory…

X AI KOLs Timeline ↗ · 2d ago Cached

This Stanford research paper introduces AutoMem, a framework that treats agent memory management as a trainable skill. By optimizing memory structure and proficiency separately, AutoMem improves base agent performance 2x-4x on long-horizon tasks, enabling a 32B open-weight model to compete with frontier systems like Claude Opus 4.5 and Gemini 3.1 Pro Thinking.

0 favorites 0 likes

#long-horizon-tasks

AutoMem: Automated Learning of Memory as a Cognitive Skill

arXiv cs.AI ↗ · 2d ago Cached

AutoMem introduces a framework that automates learning of memory management as a trainable skill for LLMs, improving performance on long-horizon tasks by 2x-4x through optimizing memory structure and proficiency.

0 favorites 0 likes

#long-horizon-tasks

OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks

Hugging Face Daily Papers ↗ · 6d ago Cached

OSWorld 2.0 is a new benchmark for evaluating computer-use agents on 108 long-horizon, real-world workflows. Current agents like Claude Opus 4.8 and GPT-5.5 achieve low completion rates, highlighting significant limitations in handling complex, multi-step tasks.

0 favorites 0 likes

#long-horizon-tasks

@astonzhangAZ: GPT-5.6 is a capable model, especially for long-horizon tasks and knowledge work across coding, computer use, and scien…

X AI KOLs Timeline ↗ · 2026-06-26 Cached

GPT-5.6 is a capable model for long-horizon tasks and knowledge work across coding, computer use, and science.

0 favorites 0 likes

#long-horizon-tasks

Why self-reflection ReAct loops fail on long-horizon tasks, and the AgentOS verification architecture we built to fix it.

Reddit r/artificial ↗ · 2026-06-21

Explains why self-reflection ReAct loops fail on long-horizon tasks and introduces the AgentOS verification architecture as a solution.

0 favorites 0 likes

#long-horizon-tasks

@jholtdigital: A friend encouraged me recently if I want to know how well something works for my use case, I should try it for a month…

X AI KOLs Following ↗ · 2026-06-20 Cached

A user shares experience using FactoryAI to convert a design system from HTML/CSS to Flutter widgets with E2E testing. The tool employs an orchestrator, workers, and validators using multiple AI models to plan and execute long-horizon tasks over 79 hours, spawning over 229 agents.

0 favorites 0 likes

#long-horizon-tasks

Xiaomi's new open source, agentic AI coding harness MiMo Code beats Claude Code at ultra-long, 200+ step tasks (14 minute read)

TLDR AI ↗ · 2026-06-12 Cached

Xiaomi open-sourced MiMo Code, an AI coding assistant with a novel memory architecture that outperforms Claude Code on long-horizon tasks, and includes free access to its MiMo-V2.5 model.

0 favorites 0 likes

#long-horizon-tasks

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

arXiv cs.AI ↗ · 2026-06-11 Cached

This paper presents HORMA, a hierarchical organize-and-retrieve memory agent that organizes agent experiences into a file-system-like structure for efficient retrieval, improving performance on long-horizon tasks while reducing token usage.

0 favorites 0 likes

#long-horizon-tasks

@rohanpaul_ai: AI agent can get better at long tasks without retraining the agent itself, by using a separate small model to clean and…

X AI KOLs Following ↗ · 2026-06-08 Cached

AdaCoM is a separate LLM that manages context for a frozen AI agent, improving performance on long tasks without retraining. It improved average web search performance by 39% in tests.

0 favorites 0 likes

#long-horizon-tasks

Signal-Driven Observation for Long-Horizon Web Agents

arXiv cs.CL ↗ · 2026-06-08 Cached

The paper proposes Signal-Driven Observation (SDO), a method for web agents to avoid context degradation by only reading task-relevant parts of the DOM and re-invoking observation only when triggered by specific signals, rather than reading the full page state at every action step.

0 favorites 0 likes

#long-horizon-tasks

CoMIC: Collaborative Memory and Insights Circulation for Long-Horizon LLM Agents in Cloud-Edge Systems

arXiv cs.AI ↗ · 2026-06-02 Cached

CoMIC is a cloud-edge framework for LLM agents that uses collaborative memory and insight circulation to improve long-horizon task performance without requiring parameter updates, achieving gains in progress rate and action grounding across multiple tasks.

0 favorites 0 likes

#long-horizon-tasks

MemPro: Agentic Memory Systems as Evolvable Programs

arXiv cs.CL ↗ · 2026-06-02 Cached

MemPro is a system-level evolution framework that treats the memory construction–retrieval pipeline as an evolvable program, using an Evolving Agent to iteratively diagnose failures and create improved versions. Experiments on long-horizon benchmarks show consistent improvement over static and prompt-level baselines with favorable performance–cost trade-off.

0 favorites 0 likes

#long-horizon-tasks

@omarsar0: Very good advice on self-improving agents. (bookmark it) This is something I am seeing in my own experiments with codin…

X AI KOLs Following ↗ · 2026-06-01 Cached

Tweet discussing advice on self-improving agents, with personal observations from experiments on coding agents for long-horizon tasks, noting that stronger models don't always yield better agents.

0 favorites 0 likes

#long-horizon-tasks

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper introduces GTA, a scalable framework for automatically generating long-horizon, multi-hop web agent tasks with executable trajectories, addressing the lack of process-level supervision in web agent benchmarks. The framework integrates crawling, retrieval-based seeding, and automated quality control to produce realistic tasks across multiple websites.

0 favorites 0 likes

#long-horizon-tasks

@wquguru: https://x.com/wquguru/status/2057852569054278045

X AI KOLs Timeline ↗ · 2026-05-22 Cached

Performed source code analysis and multi-model testing on the pi-goal tool, finding that DeepSeek V4 Pro is 31x cheaper and higher quality than Gemini 3.5 Flash on long-horizon tasks, and that higher thinking mode actually increases hallucination.

0 favorites 0 likes

#long-horizon-tasks

@0xLogicrw: Zhipu AI founder and chief scientist Tang Jie predicts that the biggest breakthrough in large models this year will be long-horizon tasks, where AI can continuously operate in real environments and solve complex problems. Once long-horizon tasks are achieved, today's 'one-person companies' will rapidly become 'no-employee companies...

X AI KOLs Timeline ↗ · 2026-05-13

Zhipu AI founder Tang Jie predicts that the biggest breakthrough in large models this year will be long-horizon tasks, where AI can continuously solve complex problems in real environments, and mentions three technical pillars and Anthropic's progress in autonomous training.

0 favorites 0 likes

#long-horizon-tasks

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces Agent-BRACE, a method that decouples LLM agents into belief state and policy models to handle long-horizon tasks in partially observable environments. By verbalizing state uncertainty, it achieves significant performance improvements over baselines while maintaining constant context window size.

0 favorites 0 likes

#long-horizon-tasks

@jietang: Recent thoughts: The Shift to Long-Horizon Tasks The most likely breakthrough this year will be in long-horizon tasks. …

X AI KOLs Timeline ↗ · 2026-05-12

The article discusses the anticipated breakthrough in long-horizon AI tasks and autonomous agents, suggesting a shift from 'one-person' to 'none-person' companies. It highlights technical pillars like memory, continual learning, and self-judging as key to realizing fully self-evolving AI systems that could redefine AGI and operating systems.

0 favorites 0 likes

#long-horizon-tasks

ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

arXiv cs.AI ↗ · 2026-05-08 Cached

This paper introduces ReFlect, a training-free harness system that wraps LLMs with deterministic error detection and recovery logic to improve performance on complex, long-horizon reasoning tasks.

0 favorites 0 likes

#long-horizon-tasks

Milestone-Guided Policy Learning for Long-Horizon Language Agents

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper introduces BEACON, a milestone-guided policy learning framework designed to improve credit assignment and sample efficiency for long-horizon language agents. It demonstrates significant performance improvements over GRPO and GiGPO on benchmarks like ALFWorld, WebShop, and ScienceWorld.

0 favorites 0 likes

long-horizon-tasks

Submit Feedback