research-judgment

Tag

Cards List
#research-judgment

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

arXiv cs.AI · yesterday Cached

Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.

0 favorites 0 likes
← Back to home

Submit Feedback