FutureSim: Replaying World Events to Evaluate Adaptive Agents

Hugging Face Daily Papers 05/14/26, 12:00 AM Papers

agent-evaluation benchmark world-events time-series simulation forecasting ai-agents

Summary

FutureSim replays chronological world events to benchmark AI agents' long-term predictive abilities, finding that even the best agent achieves only 25% accuracy.

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose building grounded simulations that replay real-world events in the order they occurred. We build FutureSim, where agents forecast world events beyond their knowledge cutoff while interacting with a chronological replay of the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predict world events over a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent's accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging research directions like long-horizon test-time adaptation, search, memory, and reasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

Original Article

View Cached Full Text

Cached at: 05/15/26, 04:24 AM

Paper page - FutureSim: Replaying World Events to Evaluate Adaptive Agents

Source: https://huggingface.co/papers/2605.15188

Abstract

FutureSim enables evaluation of AI agents’ long-term predictive capabilities by simulating chronological real-world event sequences, revealing significant gaps in current forecasting performance.

AI agents are being increasingly deployed in dynamic, open-ended environments that require adapting to new information as it arrives. To efficiently measure this capability for realistic use-cases, we propose buildinggrounded simulationsthat replay real-world eventsin the order they occurred. We build FutureSim, where agents forecastworld eventsbeyond their knowledge cutoff while interacting with achronological replayof the world: real news articles arriving and questions resolving over the simulated period. We evaluate frontier agents in their native harness, testing their ability to predictworld eventsover a three-month period from January to March 2026. FutureSim reveals a clear separation in their capabilities, with the best agent’s accuracy being 25%, and many having worse Brier skill score than making no prediction at all. Through careful ablations, we show how FutureSim offers a realistic setting to study emerging researchdirections like long-horizontest-time adaptation,search,memory, andreasoning about uncertainty. Overall, we hope our benchmark design paves the way to measure AI progress on open-ended adaptation spanning long time-horizons in the real world.

View arXiv page View PDF Project page GitHub0 Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15188 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15188 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15188 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

FutureSim: Replaying World Events to Evaluate Adaptive Agents

Paper page - FutureSim: Replaying World Events to Evaluate Adaptive Agents

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

I've been running production AI agents for months. Anthropic's "dreaming" feature solves the exact failure I kept hitting

How are people handling long-term memory + replay/debugging for AI agents?

Submit Feedback

Similar Articles

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems

Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

I've been running production AI agents for months. Anthropic's "dreaming" feature solves the exact failure I kept hitting

How are people handling long-term memory + replay/debugging for AI agents?