Tag
Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.