Tag
FutureSim replays chronological world events to benchmark AI agents' long-term predictive abilities, finding that even the best agent achieves only 25% accuracy.