FutureSim: Replaying World Events to Evaluate Adaptive Agents
Summary
FutureSim replays chronological world events to benchmark AI agents' long-term predictive abilities, finding that even the best agent achieves only 25% accuracy.
View Cached Full Text
Cached at: 05/15/26, 04:24 AM
Paper page - FutureSim: Replaying World Events to Evaluate Adaptive Agents
Source: https://huggingface.co/papers/2605.15188
Abstract
FutureSim enables evaluation of AI agents’ long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.
AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose buildinggrounded simulationsthat replay real-world eventsin the order they occurred. We build FutureSim, where agents forecastworld eventsbeyond their knowledge cutoff while interacting with achronological replayof the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predictworld eventsover a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent’s accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging researchdirections like long-horizontest-time adaptation,search,memory, andreasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.
View arXiv pageView PDFProject pageGitHub0Add to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15188 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15188 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15188 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
This paper introduces AgentForesight, a framework for online auditing and early failure prediction in LLM-based multi-agent systems. It presents a new dataset, AFTraj-22K, and a specialized model, AgentForesight-7B, which outperforms leading proprietary models in detecting decisive errors during trajectory execution.
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World introduces a self-evolving training framework for general agent intelligence that autonomously discovers real-world environments and tasks via the Model Context Protocol, enabling continuous learning. Agent-World-8B and 14B models outperform strong proprietary models across 23 challenging agent benchmarks.
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
SkillFlow introduces a benchmark of 166 tasks across 20 families for evaluating autonomous agents' ability to discover, repair, and maintain skills over time through a lifelong learning protocol. Experiments reveal a substantial capability gap among leading models, with Claude Opus 4.6 improving significantly while others show limited or negative gains from skill evolution.
I've been running production AI agents for months. Anthropic's "dreaming" feature solves the exact failure I kept hitting
Anthropic unveiled 'dreaming' and other updates for Claude Managed Agents, enabling AI agents to learn from past sessions and self-correct, alongside reports of 80x annualized growth.
How are people handling long-term memory + replay/debugging for AI agents?
A developer discusses limitations in current AI agent memory systems and proposes a new memory layer tool with episode storage and replay debugging, seeking community validation.