ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

arXiv cs.AI 06/02/26, 04:00 AM Papers
benchmark llm-agents research-judgment ai-evaluation forward-looking evidence-decoupling
Summary
Introduces ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make forward-looking research judgments from historical evidence. It contains 500 tasks across four AI domains and shows that explicit evidence organization improves traceability but reveals a recurring evidence-decision decoupling.
arXiv:2606.00644v1 Announce Type: new Abstract: AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward-looking research judgements from historical evidence. ForeSci contains 500 tasks across four fast-moving AI domains and four decision families. Each task is paired with a cutoff-aligned offline knowledge base; post-cutoff papers are hidden during generation and used only for validation. To avoid random future-event prediction, tasks are derived from pre-cutoff taxonomy branches and evidence signals, and answer-generation backbones are selected to precede the task cutoffs. We evaluate native LLMs, Hybrid RAG, and three research-agent adaptations across four backbones. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family. Diagnostics reveal a recurring evidence-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object. ForeSci turns forward-looking AI research judgement into a controlled benchmark for evaluating research agents as decision-making systems.
Original Article
View Cached Full Text
Cached at: 06/02/26, 03:48 PM
# ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
Source: [https://arxiv.org/html/2606.00644](https://arxiv.org/html/2606.00644)
Qiuyu Tian1,2Zequn Liu2Yingce Xia2Youyong Kong1Haojie Yin31Southeast University, Nanjing, China2Beijing Zhongguancun Academy, Beijing, China3Duke Kunshan University, Kunshan, China

###### Abstract

AI research often requires decisions before future evidence exists: which bottleneck to attack, which direction to pursue, or where a project should be positioned\. We introduce ForeSci, a temporally controlled benchmark for evaluating whether LLM agents can make such forward\-looking research judgements from historical evidence\. ForeSci contains 500 tasks across four fast\-moving AI domains and four decision families\. Each task is paired with a cutoff\-aligned offline knowledge base; post\-cutoff papers are hidden during generation and used only for validation\. To avoid random future\-event prediction, tasks are derived from pre\-cutoff taxonomy branches and evidence signals, and answer\-generation backbones are selected to precede the task cutoffs\. We evaluate native LLMs, Hybrid RAG, and three research\-agent adaptations across four backbones\. Results show that explicit evidence organization improves traceability and factual support, but gains depend strongly on the decision family\. Diagnostics reveal a recurring evidence\-decision decoupling: agents may cite relevant evidence while forecasting the wrong research object\. ForeSci turns forward\-looking AI research judgement into a controlled benchmark for evaluating research agents as decision\-making systems\.

ForeSci: Evaluating LLM Agents for Forward\-Looking AI Research Judgment

Qiuyu Tian1,2Zequn Liu2Yingce Xia2Youyong Kong1Haojie Yin31Southeast University, Nanjing, China2Beijing Zhongguancun Academy, Beijing, China3Duke Kunshan University, Kunshan, China

## 1Introduction

AI research moves on a timescale where today’s frontier becomes tomorrow’s baseline\. The value of a research decision \(e\.g\., which bottleneck to attack, which direction is worth a six\-month commitment\) often lies in anticipating where the field is going\. As autonomous research agents are increasingly deployed for ideation, planning, and scientific workflow execution\(Luet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib40); Liet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib6); Tanget al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib14); Yamadaet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib13); Gridachet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib12); Chenet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib15); Lupidiet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib16); Wanget al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib17)\), they are being asked to participate in this forward\-looking decision layer\. Whether current LLM agents can make defensible, evidence\-grounded research judgements about an as\-yet\-unwritten future is therefore a central open question\.

Existing benchmarks do not fully answer this question\. Prior work has mostly evaluated whether AI systems can answer questions over papers, synthesize literature\(Lálaet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib5); Wanet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib7); Lewiset al\.,[2020](https://arxiv.org/html/2606.00644#bib.bib3)\), use tools\(Yaoet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib1); Schicket al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib2)\), execute research workflows\(Chenet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib15); Lupidiet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib16); Wanget al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib17)\), or generate components of future papers, such as related work, contribution content, citations, and impact\(Ajithet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib8)\)\. None of these tasks asks whether an agent can produce an open\-ended research decision, such as picking a bottleneck, ranking a research agenda, or selecting a venue, using only the evidence available at a specific historical moment\.

Building such a benchmark raises two challenges\. First, the evidence boundary must be enforceable\. Post\-cutoff papers should not appear in retrieval or in the backbone’s training data\. Otherwise, a system may rely on hindsight rather than foresight\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib19); Yeet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib20); Liuet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib18); Ajithet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib8); Wanget al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib9)\)\. Second, the tasks must be historically inferable\. They should be grounded in signals available before the cutoff, rather than in arbitrary future events or design choices\. A foresight benchmark must therefore govern both what a system can see and what it is fair to ask\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/task_examples.png)Figure 1:Representative ForeSci task examples across the four decision families: direction forecasting, bottleneck–opportunity discovery, strategic research planning, and venue\-aware research positioning\.To address these challenges, we introduce ForeSci, a temporally controlled benchmark for forward\-looking AI research judgement\. It contains 500 tasks across four fast\-moving AI domains and four decision families \(Figure[1](https://arxiv.org/html/2606.00644#S1.F1)\)\. Each task pairs a public question with a cutoff\-aligned offline knowledge base, while post\-cutoff evidence is hidden until evaluation\. Tasks are constructed from pre\-cutoff taxonomy branches, node\-level evidence records, and method\-evolution signals, ensuring that each decision is historically inferable but not directly answerable from future leakage\. Each answer is evaluated by four complementary signals: factual support following the atomic\-fact\(Minet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib21)\), future\-target alignment\(Wanget al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib9)\), evidence traceability, and reviewer persuasiveness motivated by peer\-review reliability analyses\(Francois,[2015](https://arxiv.org/html/2606.00644#bib.bib25)\)\. We evaluate a native LLM, Hybrid RAG, and three offline\-adapted research\-agent systems across four LLM backbones\. To avoid data leakage, all systems operate within the same historical knowledge base and all LLM backbones are trained before the time cutoff\.

Results show that agent\-style methods improve evidence traceability and factuality, but the strongest method differs by decision family\. A diagnostic audit further reveals an evidence\-decision decoupling: agents can cite relevant pre\-cutoff evidence while forecasting the wrong object, mis\-assigning causal roles, or selecting the wrong intervention\. Beyond retrospective evaluation, we demonstrate that the same construction pipeline supports fully prospective forecasting, enabling continued evaluation of research agents as new literature emerges\. Our key contributions include:

- •A temporally\-controlled benchmarkwith 500 tasks across four AI domains and four decision families, paired with cutoff\-aligned offline knowledge bases and pre\-cutoff backbones; the same pipeline supports fully prospective forecasting beyond retrospective evaluation
- •A multi\-signal evaluation protocolseparating factuality, future\-target alignment, evidence traceability, and reviewer persuasiveness, validated against human experts\.
- •A systematic evaluation and diagnostic audit of LLM research agentsshowing that agent\-style methods improve traceability and factuality task\-conditionally, and identifying a previously\-unstudied failure mode—evidence\-decision decoupling\.

## 2Related Work

##### Autonomous Research Agents

AI\-for\-science systems have moved from local literature QA toward agentic workflows that retrieve, synthesize, ideate, and execute parts of the research loopLuet al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib40)\); Ghareebet al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib41)\)\. PaperQA\-style systems\(Lálaet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib5)\), Chain\-of\-Ideas\(Liet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib6)\), AI\-Researcher\(Tanget al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib14)\), AI Scientist\(Yamadaet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib13)\), Intern\-AtlasWuet al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib11)\)and recent agentic AI\-for\-science workflows\(Gridachet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib12)\)illustrate this shift toward autonomous research assistance\. As these agents are increasingly deployed for ideation and planning, they are implicitly asked to make research decisions\. Yet whether they can do so from evidence available at a specific historical moment remains an open question\. ForeSci targets this decision layer\.

##### Benchmarks for Autonomous Research

Existing benchmarks for autonomous research mainly focus on scientific reasoningLuet al\.\([2022](https://arxiv.org/html/2606.00644#bib.bib47)\); Center for AI Safetyet al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib48)\); Bragget al\.\([2025](https://arxiv.org/html/2606.00644#bib.bib49)\); Liuet al\.\([2025](https://arxiv.org/html/2606.00644#bib.bib50)\); Jansenet al\.\([2025](https://arxiv.org/html/2606.00644#bib.bib51)\), artifacts—literature\-grounded question answering\(Wanet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib7); Lálaet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib5)\), machine\-learning research workflows\(Chenet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib15); Lupidiet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib16)\), and paper\-based agent arenas\(Wanget al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib17)\)\. These benchmarks measure retrieval, tool use, synthesis, or execution\. Compared to them, ForeSci instead asks systems to make prospective research decisions rather than recover accessible answers or execute known workflows\. A few recent works begin to evaluate higher\-order research capabilities beyond idea generation or workflow execution\. Some focus on the noveltySiet al\.\([2025](https://arxiv.org/html/2606.00644#bib.bib22)\); Schopf and Färber \([2026](https://arxiv.org/html/2606.00644#bib.bib23)\), tasteTonget al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib4)\), impactJiang \([2026](https://arxiv.org/html/2606.00644#bib.bib24)\); Zhuet al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib42)\), and future alignmentWanget al\.\([2026](https://arxiv.org/html/2606.00644#bib.bib9)\)of agent\-generated ideas\. PreScience\(Ajithet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib8)\)moves further by predicting the components of future papers\. Although these works leverage future papers or citation signals as evaluation references, a perspective related to our work, ForeSci focuses on a different research scenario: strategic, forward\-looking, macro\-level scientific decision\-making\.

##### Temporal Integrity in Evaluation

Temporal integrity is essential when evaluating foresight: without a strict cutoff, systems can benefit from hindsight, leakage, or later\-stabilized terminology rather than inference\. ExAnte\(Liuet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib18)\), Set the Clock\(Zhaoet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib19)\), ForecastBench\(Kargeret al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib44)\), FutureXZenget al\.\([2025](https://arxiv.org/html/2606.00644#bib.bib43)\), FOReCAst\(Yuanet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib45)\), PROPHET\(Taoet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib46)\), and MIRAI\(Yeet al\.,[2024](https://arxiv.org/html/2606.00644#bib.bib20)\)all motivate time\-sliced evaluation for future\-oriented reasoning\. While these benchmarks mainly evaluate future event prediction in general domains, ForeSci focuses on future\-oriented scientific decision\-making in fast\-moving AI subfields\. It therefore extends temporal control to open\-ended research\-agent outputs, pairing a cutoff\-aligned offline knowledge base with hidden post\-cutoff supervision\.

## 3The ForeSci Framework

To systematically evaluate*forward\-looking AI research judgement*, ForeSci simulates a retrospective forecasting environment\. Models are tasked with making research decisions at a strict historical cutoff, utilizing only chronologically aligned evidence\.

### 3\.1Problem Formulation

Letttdenote a cutoff date,𝒦≤t\(q\)\\mathcal\{K\}\_\{\\leq t\}\(q\)denote the cutoff\-aligned knowledge base constructed for questionqq\(i\.e\., literature published up tott\), and𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)denote the withheld validation targets derived from post\-cutoff literature\. A benchmark instance is

x=\(q,t,𝒦≤t\(q\),f\),x=\(q,t,\\mathcal\{K\}\_\{\\leq t\}\(q\),f\),\(1\)whereffis the required task family\. A system returnsa=πθ\(q,𝒦≤t\(q\)\)a=\\pi\_\{\\theta\}\(q,\\mathcal\{K\}\_\{\\leq t\}\(q\)\)using only the provided cutoff\-aligned knowledge base;𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)is accessible only to evaluation\. To avoid information leakage, we use answer\-generation backbones trained before the relevant task cutoffs, disable web search, and allow systems to use only𝒦≤t\(q\)\\mathcal\{K\}\_\{\\leq t\}\(q\)as external support when producing answers\.

ForeSci instantiates this judgement problem through four task families:Direction Forecasting,Bottleneck–Opportunity Discovery,Strategic Research Planning, andVenue\-Conditioned Positioning\. Each family asks for a different research decision aftertt: predicting a concrete technical trajectory, identifying a bottleneck and the opportunity it unlocks, ranking candidate research directions under planning constraints, or positioning a project for an appropriate venue community\.

### 3\.2Data Collection and Filtering

Figure[2](https://arxiv.org/html/2606.00644#S3.F2)summarizes the construction pipeline\. ForeSci is built from four rapidly evolving AI research areas: LLM agents, LLM fine\-tuning and post\-training, RAG and retrieval structuring, and visual generative modeling\. For each area, we harvest candidate papers from arXiv111[https://arxiv\.org/](https://arxiv.org/)using domain\-specific queries, enrich publication metadata with Semantic Scholar222[https://www\.semanticscholar\.org/](https://www.semanticscholar.org/), deduplicate arXiv identifiers, and retain core/support papers after relevance and benchmark\-core screening\.

We apply two filtering stages to construct cutoff\-aligned corpora\. First, a domain\-relevance screen removes papers that only match surface keywords\. Second, a stricter benchmark\-core screen identifies representative papers with central domain contributions and future\-facing signals \(e\.g\., novel evaluation protocols, identified bottlenecks\)\. Relevant but less central papers are retained as support papers, noisy or borderline cases are excluded\. Finally, the processed corpus is chronologically truncated at the cutoff timettto form the public pre\-cutoff knowledge base𝒦≤t\\mathcal\{K\}\_\{\\leq t\}\. The specific cutoff datettvaries across task instances, encompassing three\-month \(December 31, 2025\), six\-month \(September 30, 2025\), and venue\-specific deadline settings after September 30, 2025\. Domain\-level statistics are reported in Table[1](https://arxiv.org/html/2606.00644#S3.T1); horizon details and paper\-count statistics are provided in Table[A1](https://arxiv.org/html/2606.00644#A2.T1)and Figure[A1](https://arxiv.org/html/2606.00644#A2.F1)\. Additional construction details are provided in Appendix[B](https://arxiv.org/html/2606.00644#A2)\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/benchmark_construction.png)Figure 2:Construction process for the current formal ForeSci release\. The figure shows the pipeline from corpus harvest and screening to temporal taxonomy induction, evidence and evolution asset construction, task\-family builders, hidden validation targets, and the final benchmark release with public tasks and a paper knowledge base\.DomainKB DocumentsTasksLLM Agents2,769138LLM Fine\-tuning and Post\-training2,13199RAG and Retrieval Structuring76792Visual Generative Modeling and Diffusion913171

Table 1:Domain\-level statistics in ForeSci\. KB document counts come from the cutoff\-aligned offline knowledge base; task counts come from the curated benchmark release\.
### 3\.3Taxonomy Construction

To make the foresight problem both inferable and traceable, ForeSci models the evolution of AI research through taxonomy induction\. This allows us to find specific research subdirections whose trajectories can be systematically deduced along the taxonomy and strictly grounded in historical evidence\. We build on TaxoAdapt\(Karguptaet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib10)\)to induce this taxonomy as a graph representation of the evolving research landscape\. For each domainddand cutofftt, we induce a temporal taxonomy

𝒯d,t=\(𝒱d,t,ℰd,t\),\\mathcal\{T\}\_\{d,t\}=\(\\mathcal\{V\}\_\{d,t\},\\mathcal\{E\}\_\{d,t\}\),\(2\)where nodes represent research subdirections and edges represent method\-evolution relations \(see Figure[A3](https://arxiv.org/html/2606.00644#A2.F3)for illustrative examples\)\. The taxonomy is dynamically expanded across sequential time slices of the cutoff\-aligned corpus, preserving temporal causality to prevent future information leakage\.

##### Node representation\.

Each nodev∈𝒱d,tv\\in\\mathcal\{V\}\_\{d,t\}is aggregated from multiple cutoff\-visible papers\. For each nodevv, we construct a*node evidence record*that links the subdirection back to the cutoff\-visible literature\. The record mainly includes the representative papers and supporting papers\. Each paper has a*full\-text evidence*showing what problems, methods, evaluation focus, limitations, and contribution types had already appeared beforett\.

##### High\-order signal extraction\.

From these node evidence records, we derive high\-order signals for downstream task construction:

\(1\)*candidate directions*, which group one or more related nodes into coherent research options;

\(2\)*method\-development signals*\(Wuet al\.,[2026](https://arxiv.org/html/2606.00644#bib.bib11)\), which record how methods, evaluations, or bottlenecks evolve over time \(see Figure[A4](https://arxiv.org/html/2606.00644#A2.F4)for an example\);

\(3\)*bottleneck signals*, which summarize recurring limitations, evaluation gaps, reliability or safety concerns, dataset or benchmark needs, and technical risks;

\(4\)*feasibility, dependency, and risk notes*, which record whether a candidate direction is actionable as a near\-term research plan;

\(5\)*venue\-community metadata*, which summarize publication and community context, including contribution style, maturity expectations, reviewer risks, and nearby venue contrasts\.

All the taxonomy structures are first extracted through LLM from cutoff\-aligned evidence, then checked by human experts who verifies support strength and temporal validity\.

### 3\.4Task Families

We derive four task families from the taxonomy\. Task instances are constructed through a human–LLM collaborative process: an LLM first drafts candidate questions, options, and answers from the taxonomy\-derived evidence records\. Human experts then inspect the source evidence, check cutoff validity and leakage risk, revise unclear or weakly grounded items, and approve each final instance, ensuring that the benchmark instances reflect expert\-validated foresight challenges rather than merely the taxonomy’s structure\.

##### Direction Forecasting\.

This family asks the system to choose, from a fixed set of*candidate directions*, which direction is most likely to gain momentum in the post\-cutoff window\. The taskqqis grounded in*node evidence records*and*method\-development signals*beforett\. The hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)are candidate directions \(including primary directions and acceptable neighbors\) with trajectory labels \(e\.g\.,accelerating,steady\) induced by the*method\-development signals*after cutoff, combined with the necessary evidence before the cutoff\.

##### Bottleneck–Opportunity Discovery\.

This family asks the system to identify one root bottleneck in a cutoff\-visible research subdirection and explain what one\-hop opportunity would open if that bottleneck were reduced\. The taskqqis grounded in*bottleneck signals*,*full\-text evidence*, and*method\-development signals*\. The hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)are bottleneck–opportunity pairs induced from*bottleneck signals*beforettand*method\-development signals*aftertt, including primary bottlenecks, acceptable bottleneck variants, unlocked opportunities, and mechanism descriptions, combined with the necessary evidence before the cutoff\.

##### Strategic Research Planning\.

This family asks the system to rank a fixed set of research options for a hypothetical team making a near\-term research plan at the cutoff\. The taskqqis derived from*candidate directions*and*node evidence records*,*method\-development signals*,*bottleneck signals*,*feasibility, dependency, and risk notes*beforett\. The hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)are ranked*candidate directions*, including the preferred ordering, top\-priority option, rationale units, milestones, dependencies, risks, and go/no\-go criteria induced from post\-cutoff*method\-development signals*and*bottleneck signals*, combined with the necessary evidence and*feasibility, dependency, and risk notes*before cutoff\.

##### Venue\-Conditioned Positioning\.

This family asks the system to position a proposed contribution for a target venue cycle\. Given a project description and a fixed set of venue or track options, the system must rank or conditionally recommend venue families, explain the appropriate framing, identify reviewer risks, and specify what evidence upgrades would make the project credible for the target venue community\. The task uses contribution types from*full\-text evidence*, and*venue\-community metadata*\. The hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)are venue\-positioning decisions induced from*venue\-community metadata*and post\-cutoff*method\-development signals*that reflect community expectations, combined with the necessary evidence before the cutoff\.

Across all families, public questions do not expose internal taxonomy information or post\-cutoff outcomes\. The formal release contains 125 tasks for each family\. Additional details on the benchmark construction are provided in Appendix[B](https://arxiv.org/html/2606.00644#A2)\.

## 4Evaluation

### 4\.1Metrics

For each public questionqq, system answeraa, pre\-cutoff support packetℰ≤t\(q\)\\mathcal\{E\}\_\{\\leq t\}\(q\), and hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\), we report four complementary metrics\. These metrics are designed to assess whether the answer states correct future facts, reaches a conclusion consistent with the future target, grounds its reasoning in visible pre\-cutoff evidence, and presents a judgment persuasive to a virtual reviewer\.

##### Prediction Factuality \(*Fact*\)\.

This metric evaluates whether the answer makes claims that are supported by𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)\. Following the atomic\-fact view of FACTSCORE\(Minet al\.,[2023](https://arxiv.org/html/2606.00644#bib.bib21)\), we extract atomic claims𝒞\(a\)\\mathcal\{C\}\(a\)from the answer\. We also define a hidden claim bank𝒞∗\(q\)⊂𝒢\>t\(q\)\\mathcal\{C\}^\{\*\}\(q\)\\subset\\mathcal\{G\}\_\{\>t\}\(q\): a set of task\-relevant atomic validation claims derived from hidden future validation targets\. Prediction Factuality is their claim\-level F1\.

##### Future\-Target Alignment \(*FTA*\)\.

This metric evaluates whether the answer aligns with the task\-family\-specific future target in𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)\. For Direction Forecasting and Bottleneck–Opportunity Discovery, it compares extracted prediction claims with hidden claim bank using bge\-m3 similarity\. For Strategic Research Planning and Venue\-Conditioned Positioning, where the target is an ordered decision, it computes deterministic ranking alignment against the hidden preferred ranking\.

##### Evidence Traceability Score \(*Trace*\)\.

This metric evaluates whether the answer can be traced to the pre\-cutoff support packetℰ≤t\(q\)\\mathcal\{E\}\_\{\\leq t\}\(q\)\. The evaluator scores whether the answer uses relevant pre\-cutoff evidence, whether that evidence supports the stated decision, and whether the reasoning avoids unsupported jumps from the available literature\. Evidence Traceability Score is reported as a normalized rubric score in\[0,1\]\[0,1\]\.

##### Reviewer Persuasiveness \(*Pers*\)\.

This metric evaluates whether the answer presents a strong research judgment persuasive to a LLM\-based virtual reviewer\. For each task familyff, a rubricℛf\\mathcal\{R\}\_\{f\}scores\(q,a,ℰ≤t\(q\),𝒢\>t\(q\)\)\(q,a,\\mathcal\{E\}\_\{\\leq t\}\(q\),\\mathcal\{G\}\_\{\>t\}\(q\)\)on task\-specific decision quality, mechanistic reasoning, comparative reasoning, clarity, and risk awareness:

Pers\.\(a,q\)=ℛf\(q,a,ℰ≤t\(q\),𝒢\>t\(q\)\)\.\\mathrm\{Pers\.\}\(a,q\)=\\mathcal\{R\}\_\{f\}\(q,a,\\mathcal\{E\}\_\{\\leq t\}\(q\),\\mathcal\{G\}\_\{\>t\}\(q\)\)\.
Automatic evaluation uses DeepSeek\-V4 as the evaluator on the 500\-task formal release\. Appendix[C\.3](https://arxiv.org/html/2606.00644#A3.SS3)reports human validation for the automatic metrics\. Appendix[C](https://arxiv.org/html/2606.00644#A3)gives family\-specific prompts and scoring rules\. For rubric\-style metrics, we repeat evaluator runs and report the mean score with variance\. This applies to Evidence Traceability Score \(*Trace*\) and Reviewer Persuasiveness \(*Pers\.*\), where repeated scoring makes evaluator uncertainty visible\.

### 4\.2Models, Systems, and Adaptation

We evaluate five systems:Native LLMwithout retrieval,Hybrid RAGwith sparse\+dense retrieval, and three offline\-adapted agentic systems:CoI\-style,ResearchAgent\-style, andARIS\-style\. We adapt the agentic systems to ForeSci by constraining retrieval, tool use, and memory to the offline knowledge base and by rendering final answers through task\-family\-specific output schemas\. Detailed adaptation notes are in Appendix[D](https://arxiv.org/html/2606.00644#A4)\. We evaluate Qwen3\-235B \(released: April 29, 2025\(Qwen Team,[2025](https://arxiv.org/html/2606.00644#bib.bib36)\)\), GPT\-5\.2 \(knowledge cutoff: August 31, 2025\(OpenAI,[2025](https://arxiv.org/html/2606.00644#bib.bib37)\)\), GLM\-4\.6 \(released: September 30, 2025\(Z\.AI,[2025](https://arxiv.org/html/2606.00644#bib.bib38)\)\), and Gemini\-3 \(knowledge cutoff: January 2025\(Google,[2025](https://arxiv.org/html/2606.00644#bib.bib39)\)\), LLM backbones trained before the cutoff time to avoid data leakage\.

## 5Results

### 5\.1Evaluation of LLM Agents

Qwen3\-235BGPT\-5\.2GLM\-4\.6Gemini\-3MethodFact\.FTATracePersFact\.FTATracePersFact\.FTATracePersFact\.FTATracePersNative LLM0\.6030\.622–0\.7860\.6180\.628–0\.8460\.5090\.590–0\.6740\.5440\.609–0\.741Hybrid RAG0\.5970\.6300\.4320\.7750\.6100\.6260\.4080\.8370\.5200\.6130\.4290\.6580\.5590\.5980\.4130\.720CoI0\.6110\.6420\.5600\.7820\.6260\.6320\.5930\.8570\.5430\.6200\.4990\.6620\.5630\.6060\.4730\.734ResearchAgent0\.6090\.6600\.5630\.7870\.6350\.6330\.5840\.8570\.5400\.6330\.4990\.6560\.5620\.6090\.4590\.729ARIS0\.6070\.6440\.6080\.7930\.6170\.6420\.6270\.8610\.5370\.6190\.5200\.6490\.5600\.6020\.5670\.733

Table 2:Overall results on ForeSci\. Bold marks the best method within the same backbone and metric\.Table[2](https://arxiv.org/html/2606.00644#S5.T2)reports Prediction Factuality \(Fact\), Future\-Target Alignment \(FTA\), Evidence Traceability Score \(Trace\), and Reviewer Persuasiveness \(Pers\) across four backbones and five methods\. Appendix[E](https://arxiv.org/html/2606.00644#A5)gives five\-run evaluator stability views \(Tables[A8](https://arxiv.org/html/2606.00644#A5.T8)\)\.

Agent\-style methods generally improve evidence\-grounded metrics\. Across backbones, the strongest agent is competitive with or better than Native LLM and Hybrid RAG on Fact and FTA, and all three agents consistently improve Trace over Hybrid RAG\. This suggests that agentic workflows can better align answers with future validation targets while exposing pre\-cutoff grounding more explicitly\. These gains do not consistently improve Reviewer Persuasiveness\. One explanation is that backbones use retrieved or structured artifacts differently: for some, they support reasoning; for others, they add noise behind a coherent final justification, lowering the quality of the judgment report\. Method rankings also vary by task family \(Table[A7](https://arxiv.org/html/2606.00644#A5.T7)\)\. No agent is uniformly strongest across metrics, backbones, and task families, and in some settings agentic methods show no clear advantage over the native backbone\. Additional retrieval and tool use therefore do not automatically translate into better foresight performance, motivating the error analysis below\.

### 5\.2Error Mechanisms: When Foresight Fails

##### Family\-dependent failures

We further conduct an internal error analysis to demonstrate how LLM agents fail in foresight tasks\. We first identify low\-scoring cases for each metric using the bottom 20% of rows as the low\-score threshold, and then compute the fraction of low\-score cases within each task family\. Figure[3](https://arxiv.org/html/2606.00644#S5.F3)\(a\) shows that failures are strongly family\-dependent\. For example, Strategic Planning has the highest low\-score rates on Fact and FTA, reflecting the difficulty of matching both the ranked decision and its supporting facts\. Together, these patterns motivate the use of multiple evaluation signals, as Fact, FTA, Trace, and Persuasiveness reveal distinct failure channels that a single aggregate score would obscure\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/main_heatmap.png)Figure 3:Low\-score channels and drift\-induced effects\. \(a\) Bottom\-20% low\-score rates across evaluation metrics and task families\. Each cell reports the fraction of examples in a task family that fall into the low\-score set for a given metric\. \(b\) Normalized metric drop caused by evidence\-to\-decision drift, computed by comparing cases with no drift severity against cases with severe drift\. \(c\) Drift severity among high\-traceability answers\. High\-traceability but low\-FTA answers exhibit substantially higher drift severity across all drift types\.
##### Evidence\-to\-decision drift

We then analyze evidence\-to\-decision drift by comparing model answers with reference answers\. We use LLM\-based classification with human expert verification to identify four common types of answer drift: \(1\)*Scope/granularity drift*occurs when the answer discusses a related research direction but at the wrong level of specificity\. \(2\)*Causal\-role drift*occurs when the answer assigns the wrong role to a technical factor, such as treating an enabled opportunity as the root bottleneck\.\(3\)*Intervention\-mode drift*occurs when the answer targets the right general issue but recommends the wrong type of intervention, such as proposing system integration improvements when the reference calls for a change in the training objective\. \(4\)*Temporal\-horizon drift*occurs when the answer targets the wrong maturity stage, such as jumping from a near\-term opportunity to a much longer\-term vision\. Each drift type is annotated with a severity score in\[0,3\]\[0,3\], where 0 indicates no drift and 3 indicates severe drift\. For each drift type, we sample 20 tasks per family and include all five methods and four backbones, yielding 1600 matched answers and 6400 dimension\-level annotations\.

To quantify the metric impact of each drift type, we compute a normalized effect size:

Δnorm\(m\)\\displaystyle\\Delta\_\{\\mathrm\{norm\}\}\(m\)=𝔼\[m∣s=0\]−𝔼\[m∣s≥2\]SD\(m\)\.\\displaystyle=\\frac\{\\mathbb\{E\}\[m\\mid s=0\]\-\\mathbb\{E\}\[m\\mid s\\geq 2\]\}\{\\mathrm\{SD\}\(m\)\}\.wheremmis the target metric andssis the annotated drift severity\. Figure[3](https://arxiv.org/html/2606.00644#S5.F3)\(b\) shows that severe drift substantially reduces the content\-facing metrics\. Causal\-role drift lowers Fact by 1\.13 standard deviations, while scope/granularity and intervention\-mode drift lower FTA by 1\.22 and 1\.12 standard deviations, respectively\. Persuasiveness also declines under severe drift, but less uniformly\. In contrast, Trace is much more weakly coupled to these content drifts and its direction depends on the drift type\.

##### High traceability but high drift

We therefore further inspect high\-Trace cases in Figure[3](https://arxiv.org/html/2606.00644#S5.F3)\(c\)\. Among answers with high Trace, the low\-FTA subset has much higher drift severity across all four bias types than the non\-low\-FTA subset\. This confirms that an answer can be well supported by local evidence while still selecting the wrong decision object, causal role, intervention type, or time horizon\. A compact case in Appendix Figure[A6](https://arxiv.org/html/2606.00644#A6.F6)illustrates the distinction\. A Gemini\-3 ARIS answer for a venue positioning task has high traceability \(0\.920\) but low Prediction Factuality \(0\.200\) and low FTA \(0\.355\): it gives a plausible NeurIPS\-first framing for a reinforcement\-learning\-from\-AI\-feedback contribution, but this framing is less aligned with the task’s reference target, which prioritizes ACL/EMNLP because the work is framed as language\-model post\-training and alignment\.

##### Method fingerprints

We also examine the content\-level failure patterns of different methods and find that each method has a distinct diagnostic fingerprint, summarized in Appendix[F\.1](https://arxiv.org/html/2606.00644#A6.SS1)\. Overall, these results suggest that agentic evidence organization should be used with caution: while agents can improve traceability, they may also over\-amplify locally supported but decision\-misaligned evidence, thereby steering the model toward a confidently grounded yet incorrect research judgment\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/question_example_main.png)Figure 4:Prospective forecasting showcase for a Direction Forecasting task\. The displayed agent answer is a summarized version of the full generated response, retaining the predicted direction, trajectory label, and core rationale\.

### 5\.3Prospective Use: Dynamic Forecasting Beyond Retrospective Evaluation

ForeSci is designed not only for retrospective evaluation but also for fully prospective forecasting\. As a proof of concept, we apply the same cutoff\-controlled taxonomy and evidence\-construction pipeline to the LLM\-agent domain with a 2026\-05\-15 literature cutoff, producing 12 prediction\-only questions balanced across the four task families for the 2026\-05\-16 to 2026\-08\-15 forecast window\. Because the target outcomes had not occurred at writing time, this package is not scored; instead, it demonstrates that the framework can be refreshed with recent literature to generate transparent forecast artifacts\. In the main text, we show one representative agent\-generated forecast case to illustrate how a system turns cutoff\-visible evidence into a concrete forward\-looking research judgment \(Figure[4](https://arxiv.org/html/2606.00644#S5.F4)\)\. This prospective mode enables dynamic evaluation of newly released LLM agents and can also support evidence\-grounded AI research planning before future results are known\. Additional package details and generated examples are provided in Appendix[G](https://arxiv.org/html/2606.00644#A7)\.

## 6Conclusion

ForeSci evaluates whether LLM agents can turn historically available evidence into forward\-looking AI research judgements\. Its 500 cutoff\-controlled tasks pair offline knowledge bases with hidden post\-cutoff validation targets across four decision families\. Results show that agentic workflows often improve traceability and some evidence\-grounded metrics, but no method is uniformly best across backbones, task families, and evaluation signals\. The diagnostics further reveal evidence\-decision decoupling: agents can cite relevant evidence yet choose the wrong research object, causal role, intervention mode, or time horizon\. By separating factual support, future\-target alignment, traceability, and reviewer\-style persuasiveness, ForeSci makes these failures measurable\. Its prospective mode also shows how refreshed literature can produce transparent forecast artifacts, supporting evaluation of research agents as decision\-making systems rather than literature interfaces alone\.

## Limitations

ForeSci studies forward\-looking research judgement in four fast\-moving AI areas and four decision families\. Its results should therefore be interpreted as evidence about this controlled benchmark setting, not as a universal ranking of research agents across all scientific domains, languages, or time horizons\. The benchmark emphasizes paper\-visible signals; it cannot fully capture tacit community knowledge, unpublished work, private reviewer expectations, or downstream adoption\.

The evaluation also depends on hidden post\-cutoff targets and LLM\-as\-judge metrics\. We use family\-conditioned rubrics, repeated judging for rubric\-style metrics, cross\-backbone comparisons, and diagnostic audits to reduce over\-interpretation, but the scores remain approximations of rubric\-based reviewer persuasiveness rather than direct measurements of scientific value\. In particular, venue positioning and strategic planning are inherently preference\-sensitive decisions, so the benchmark is best used to compare failure modes and evidence use rather than to certify a single best method\.

## Ethical Considerations

The benchmark is built from public scholarly artifacts and is intended for diagnostic evaluation of research\-assistant systems\. It should not be used to automate real venue recommendations, peer\-review decisions, or research prioritization without human oversight\. Because the tasks ask systems to make forward\-looking research decisions, a poorly calibrated system could encourage premature convergence on fashionable directions or overstate the evidential basis for a forecast\. We therefore report traceability, uncertainty\-sensitive reviewer scores, and limitations alongside outcome\-oriented metrics\.

## Code and Data Availability

Code, public benchmark artifacts, prompts, and evaluation scripts will be released at[https://github\.com/roytian1992/ResearchForesight](https://github.com/roytian1992/ResearchForesight)\. Hidden validation targets will be withheld to preserve benchmark integrity\.

## References

- ACL 2025 call for main conference papers\.Note:[https://2025\.aclweb\.org/calls/main\_conference\_papers/](https://2025.aclweb.org/calls/main_conference_papers/)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- A\. Ajith, A\. Singh, J\. DeYoung, N\. Kunievsky, A\. C\. Kozlowski, O\. Tafjord, J\. Evans, D\. S\. Weld, T\. Hope, and D\. Downey \(2026\)PreScience: a benchmark for forecasting scientific contributions\.arXiv preprint arXiv:2602\.20459\.External Links:[Link](https://arxiv.org/abs/2602.20459),[Document](https://dx.doi.org/10.48550/arXiv.2602.20459)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§1](https://arxiv.org/html/2606.00644#S1.p3.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Bragg, M\. D’Arcy, N\. Balepur, D\. Bareket, B\. Dalvi Mishra, S\. Feldman, D\. Haddad, J\. D\. Hwang, P\. Jansen, V\. Kishore, B\. P\. Majumder, A\. Naik, S\. Rahamimov, K\. Richardson, A\. Singh, H\. Surana, A\. Tiktinsky, R\. Vasu, G\. Wiener, C\. Anastasiades, S\. Candra, J\. Dunkelberger, D\. Emery, R\. Evans, M\. Hamada, R\. Huff, R\. Kinney, M\. Latzke, J\. Lochner, R\. Lozano\-Aguilera, N\. Nguyen, S\. Rao, A\. Tanaka, B\. Vlahos, P\. Clark, D\. Downey, Y\. Goldberg, A\. Sabharwal, and D\. S\. Weld \(2025\)AstaBench: rigorous benchmarking of AI agents with a scientific research suite\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- Center for AI Safety, Scale AI, and HLE Contributors Consortium \(2026\)A benchmark of expert\-level academic questions to assess AI capabilities\.Nature649,pp\. 1139–1146\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Chen, M\. Xiong, Y\. Lu, W\. Han, A\. Deng, Y\. He, J\. Wu, Y\. Li, Y\. Liu, and B\. Hooi \(2025\)Mlr\-bench: evaluating ai agents on open\-ended machine learning research\.Advances in Neural Information Processing Systems38\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- CVPR \(2025\)CVPR 2025 call for papers\.Note:[https://cvpr\.thecvf\.com/Conferences/2025/CallForPapers](https://cvpr.thecvf.com/Conferences/2025/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- ECCV \(2024\)ECCV 2024 call for papers\.Note:[https://eccv2024\.ecva\.net/Conferences/2024/CallForPapers](https://eccv2024.ecva.net/Conferences/2024/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- EMNLP \(2025\)EMNLP 2025 call for main conference papers\.Note:[https://2025\.emnlp\.org/calls/main\_conference\_papers/](https://2025.emnlp.org/calls/main_conference_papers/)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- O\. Francois \(2015\)Arbitrariness of peer review: a bayesian analysis of the NIPS experiment\.arXiv preprint arXiv:1507\.06411\.External Links:[Link](https://arxiv.org/abs/1507.06411),[Document](https://dx.doi.org/10.48550/arXiv.1507.06411)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p4.1)\.
- A\. E\. Ghareeb, B\. Chang, L\. Mitchener, A\. Yiu, C\. J\. Szostkiewicz, D\. Shved, G\. J\. Gyimesi, J\. M\. Laurent, S\. M\. Wright, M\. T\. Razzak, A\. D\. White, S\. C\. Finnemann, M\. M\. Hinks, and S\. G\. Rodriques \(2026\)A multi\-agent system for automating scientific discovery\.Nature,pp\. 1–3\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- Google \(2025\)Gemini models\.Note:[https://ai\.google\.dev/gemini\-api/docs/models](https://ai.google.dev/gemini-api/docs/models)Accessed 2026\-05\-26Cited by:[§4\.2](https://arxiv.org/html/2606.00644#S4.SS2.p1.1)\.
- M\. Gridach, J\. Nanavati, K\. Zine El Abidine, L\. Mendes, and C\. Mack \(2025\)Agentic AI for scientific discovery: a survey of progress, challenges, and future directions\.arXiv preprint arXiv:2503\.08979\.External Links:[Link](https://arxiv.org/abs/2503.08979),[Document](https://dx.doi.org/10.48550/arXiv.2503.08979)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- ICCV \(2025\)ICCV 2025 call for papers\.Note:[https://iccv\.thecvf\.com/Conferences/2025/CallForPapers](https://iccv.thecvf.com/Conferences/2025/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- ICLR \(2025\)ICLR 2025 call for papers\.Note:[https://iclr\.cc/Conferences/2025/CallForPapers](https://iclr.cc/Conferences/2025/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- ICML \(2025\)ICML 2025 call for papers\.Note:[https://icml\.cc/Conferences/2025/CallForPapers](https://icml.cc/Conferences/2025/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- P\. Jansen, S\. Hassan, and R\. Wang \(2025\)Matter\-of\-fact: a benchmark for verifying the feasibility of literature\-supported claims in materials science\.InEmpirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Jiang \(2026\)HindSight: evaluating LLM\-generated research ideas via future impact\.arXiv preprint arXiv:2603\.15164\.External Links:[Link](https://arxiv.org/abs/2603.15164),[Document](https://dx.doi.org/10.48550/arXiv.2603.15164)Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Karger, H\. Bastani, Y\. Chen, Z\. Jacobs, D\. Halawi, F\. Zhang, and P\. Tetlock \(2025\)ForecastBench: a dynamic benchmark of AI forecasting capabilities\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- P\. Kargupta, N\. Zhang, Y\. Zhang, R\. Zhang, P\. Mitra, and J\. Han \(2025\)Taxoadapt: aligning llm\-based multidimensional taxonomy construction to evolving research corpora\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 29834–29850\.Cited by:[§B\.2](https://arxiv.org/html/2606.00644#A2.SS2.p2.1),[§3\.3](https://arxiv.org/html/2606.00644#S3.SS3.p1.2)\.
- KDD \(2025\)KDD 2025 research track call for papers\.Note:[https://kdd2025\.kdd\.org/research\-track\-call\-for\-papers/](https://kdd2025.kdd.org/research-track-call-for-papers/)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- J\. Lála, O\. O’Donoghue, A\. Shtedritski, S\. Cox, S\. G\. Rodriques, and A\. D\. White \(2023\)PaperQA: retrieval\-augmented generative agent for scientific research\.arXiv preprint arXiv:2312\.07559\.External Links:[Link](https://arxiv.org/abs/2312.07559),[Document](https://dx.doi.org/10.48550/arXiv.2312.07559)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1)\.
- L\. Li, W\. Xu, J\. Guo, R\. Zhao, X\. Li, Y\. Yuan, B\. Zhang, Y\. Jiang, Y\. Xin, R\. Dang, D\. Zhao, Y\. Rong, T\. Feng, and L\. Bing \(2024\)Chain of ideas: revolutionizing research via novel idea development with llm agents\.arXiv preprint arXiv:2410\.13185\.External Links:[Link](https://arxiv.org/abs/2410.13185),[Document](https://dx.doi.org/10.48550/arXiv.2410.13185)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Liu, X\. Wei, L\. Shi, X\. Li, B\. Zhang, P\. S\. Dhillon, and Q\. Mei \(2026\)Exante: a benchmark for ex\-ante inference in large language models\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1551–1571\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p3.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Liu, Z\. Yang, T\. Xie, J\. Ni, B\. Gao, Y\. Li, S\. Tang, W\. Ouyang, E\. Cambria, and D\. Zhou \(2025\)ResearchBench: benchmarking LLMs in scientific discovery via inspiration\-based task decomposition\.arXiv preprint arXiv:2503\.21248\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, Y\. Yamada, S\. Hu, J\. Foerster, D\. Ha, and J\. Clune \(2026\)Towards end\-to\-end automation of ai research\.Nature651\(8107\),pp\. 914–919\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Lu, S\. Mishra, T\. Xia, L\. Qiu, K\. Chang, S\. Zhu, O\. Tafjord, P\. Clark, and A\. Kalyan \(2022\)Learn to explain: multimodal reasoning via thought chains for science question answering\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Lupidi, B\. Gauri, T\. S\. Foster, B\. Al Omari, D\. Magka, A\. Pepe, A\. Audran\-Reiss, M\. Aghamelu, N\. Baldwin, L\. Cipolina\-Kun, J\. Gagnon\-Audet, C\. H\. Leow, S\. Lefdal, H\. Mossalam, A\. Moudgil, S\. Nazir, E\. Tewolde, I\. Urrego, J\. Armengol Estape, A\. Budhiraja, G\. Chaurasia, A\. Charnalia, D\. Dunfield, K\. Hambardzumyan, D\. Izcovich, M\. Josifoski, I\. Mediratta, K\. Niu, P\. Pathak, M\. Shvartsman, E\. Toledo, A\. Protopopov, R\. Raileanu, A\. Miller, T\. Shavrina, J\. Foerster, and Y\. Bachrach \(2026\)AIRS\-bench: a suite of tasks for frontier AI research science agents\.arXiv preprint arXiv:2602\.06855\.External Links:[Link](https://arxiv.org/abs/2602.06855),[Document](https://dx.doi.org/10.48550/arXiv.2602.06855)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)Factscore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12076–12100\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p4.1),[§4\.1](https://arxiv.org/html/2606.00644#S4.SS1.SSS0.Px1.p1.3)\.
- NeurIPS \(2025\)NeurIPS 2025 call for papers\.Note:[https://neurips\.cc/Conferences/2025/CallForPapers](https://neurips.cc/Conferences/2025/CallForPapers)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- OpenAI \(2025\)GPT\-5\.2\.Note:[https://openai\.com/index/introducing\-gpt\-5\-2/](https://openai.com/index/introducing-gpt-5-2/)Accessed 2026\-05\-26Cited by:[§4\.2](https://arxiv.org/html/2606.00644#S4.SS2.p1.1)\.
- Qwen Team \(2025\)Qwen3: think deeper, act faster\.Note:[https://qwenlm\.github\.io/blog/qwen3/](https://qwenlm.github.io/blog/qwen3/)Accessed 2026\-05\-26Cited by:[§4\.2](https://arxiv.org/html/2606.00644#S4.SS2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1)\.
- T\. Schopf and M\. Färber \(2026\)Is this idea novel? an automated benchmark for judgment of research ideas\.arXiv preprint arXiv:2603\.10303\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Si, D\. Yang, and T\. Hashimoto \(2025\)Can llms generate novel research ideas? a large\-scale human study with 100\+ nlp researchers\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 94003–94092\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- SIGIR \(2025\)SIGIR 2025 call for full papers\.Note:[https://sigir2025\.dei\.unipd\.it/call\-full\-papers\.html](https://sigir2025.dei.unipd.it/call-full-papers.html)Accessed 2026\-05\-20Cited by:[§B\.7](https://arxiv.org/html/2606.00644#A2.SS7.SSS0.Px1.p1.1)\.
- J\. Tang, L\. Xia, Z\. Li, and C\. Huang \(2025\)Ai\-researcher: autonomous scientific innovation\.Advances in Neural Information Processing Systems38,pp\. 9481–9520\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Tao, P\. Wu, Z\. Jin, X\. Bai, H\. Zhao, C\. Dou, X\. Chen, J\. Li, L\. Li, C\. Tao, and W\. Zhang \(2025\)PROPHET: an inferable future forecasting benchmark with causal intervened likelihood estimation\.arXiv preprint arXiv:2504\.01509\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Tong, M\. Li, H\. Li, Y\. Yang, Y\. Mou, W\. Ma, Z\. Xi, H\. Chen, X\. Liu, Q\. Cheng, M\. Zhang, Q\. Chen, W\. Ge, Q\. Guo, T\. Ying, T\. Sun, Y\. Zheng, X\. Chen, J\. Zhao, N\. Ding, X\. Huang, Y\. Jiang, and X\. Qiu \(2026\)AI can learn scientific taste\.arXiv preprint arXiv:2603\.14473\.External Links:[Link](https://arxiv.org/abs/2603.14473)Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wan, Y\. Liu, A\. Ajith, C\. Grazian, B\. Hoex, W\. Zhang, C\. Kit, T\. Xie, and I\. Foster \(2024\)SciQAG: a framework for auto\-generated science question answering dataset with fine\-grained evaluation\.arXiv preprint arXiv:2405\.09939\.External Links:[Link](https://arxiv.org/abs/2405.09939),[Document](https://dx.doi.org/10.48550/arXiv.2405.09939)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Wang, M\. Cheng, S\. Yu, Z\. Liu, Z\. Guo, X\. Li, and Q\. Liu \(2025\)PaperArena: an evaluation benchmark for tool\-augmented agentic reasoning on scientific literature\.arXiv preprint arXiv:2510\.10909\.External Links:[Link](https://arxiv.org/abs/2510.10909),[Document](https://dx.doi.org/10.48550/arXiv.2510.10909)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§1](https://arxiv.org/html/2606.00644#S1.p2.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Wang, P\. Jiang, J\. Sun, Z\. Shi, H\. Yu, J\. Han, and H\. Ji \(2026\)Learning to predict future\-aligned research proposals with language models\.arXiv preprint arXiv:2603\.27146\.External Links:[Link](https://arxiv.org/abs/2603.27146),[Document](https://dx.doi.org/10.48550/arXiv.2603.27146)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p3.1),[§1](https://arxiv.org/html/2606.00644#S1.p4.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Wu, D\. Zhang, X\. Li, J\. Xu, Y\. Duan, Y\. Liu, J\. Pan, Q\. Zhu, X\. Zhou, J\. Wei, S\. Li, J\. Chen, C\. He, and C\. Tan \(2026\)Intern\-atlas: a methodological evolution graph as research infrastructure for AI scientists\.arXiv preprint arXiv:2604\.28158\.External Links:[Link](https://arxiv.org/abs/2604.28158),[Document](https://dx.doi.org/10.48550/arXiv.2604.28158)Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2606.00644#S3.SS3.SSS0.Px2.p3.1)\.
- Y\. Yamada, R\. T\. Lange, C\. Lu, S\. Hu, C\. Lu, J\. Foerster, J\. Clune, and D\. Ha \(2025\)The AI scientist\-v2: workshop\-level automated scientific discovery via agentic tree search\.arXiv preprint arXiv:2504\.08066\.External Links:[Link](https://arxiv.org/abs/2504.08066),[Document](https://dx.doi.org/10.48550/arXiv.2504.08066)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p1.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=WE_vluYUL-X)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p2.1)\.
- C\. Ye, Z\. Hu, Y\. Deng, Z\. Huang, M\. D\. Ma, Y\. Zhu, and W\. Wang \(2024\)MIRAI: evaluating LLM agents for event forecasting\.arXiv preprint arXiv:2407\.01231\.External Links:[Link](https://arxiv.org/abs/2407.01231),[Document](https://dx.doi.org/10.48550/arXiv.2407.01231)Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p3.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Yuan, Z\. Ding, and A\. Vlachos \(2026\)Introducing FOReCAst: the future outcome reasoning and confidence assessment benchmark\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=7hVyqs8NaP)Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- Z\.AI \(2025\)GLM\-4\.6 release notes\.Note:[https://docs\.z\.ai/guides/llm/glm\-4\.6](https://docs.z.ai/guides/llm/glm-4.6)Accessed 2026\-05\-26Cited by:[§4\.2](https://arxiv.org/html/2606.00644#S4.SS2.p1.1)\.
- Z\. Zeng, J\. Liu, S\. Chen, T\. He, Y\. Liao, Y\. Tian, J\. Wang, Z\. Wang, Y\. Yang, L\. Yin, M\. Yin, Z\. Zhu, T\. Cai, Z\. Chen, J\. Chen, Y\. Du, X\. Gao, J\. Guo, L\. Hu, J\. Jiao, X\. Li, J\. Liu, S\. Ni, Z\. Wen, G\. Zhang, K\. Zhang, X\. Zhou, J\. Blanchet, X\. Qiu, M\. Wang, and W\. Huang \(2025\)Futurex: an advanced live benchmark for llm agents in future prediction\.arXiv preprint arXiv:2508\.11987\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Zhao, Z\. Brumbaugh, Y\. Wang, H\. Hajishirzi, and N\. A\. Smith \(2024\)Set the clock: temporal alignment of pretrained language models\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15015–15040\.Cited by:[§1](https://arxiv.org/html/2606.00644#S1.p3.1),[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Zhu, Y\. Zhang, P\. Nie, and Y\. Zhang \(2026\)SciImpact: a multi\-dimensional, multi\-field benchmark for scientific impact prediction\.arXiv preprint arXiv:2604\.17141\.Cited by:[§2](https://arxiv.org/html/2606.00644#S2.SS0.SSS0.Px2.p1.1)\.

## Appendix AResponsible Research and Artifact Details

##### Artifacts, licenses, and intended use\.

ForeSci releases benchmark tasks, evaluation goldsets, prompts, schemas, scripts, and a cutoff\-aligned scholarly knowledge base for research evaluation\. The released benchmark artifacts are intended for diagnostic comparison of research\-assistant systems under explicit temporal controls, not for automated peer\-review, venue selection, hiring, funding, or research\-prioritization decisions\. Source scholarly papers and bibliographic records retain their original access conditions; derived benchmark packages should therefore be used consistently with the repository and release terms and with the access conditions of the underlying public scholarly artifacts\.

##### Privacy and content review\.

The benchmark is constructed from public scholarly artifacts rather than private user data\. We do not collect demographic attributes, private communications, or personally sensitive records\. Released task files contain only public task text and minimal metadata needed for answer generation, while evaluation goldsets are separated from model\-visible tasks\. Human\-validation results are reported only in aggregate, and no individual annotator records are released\.

##### Compute and model access\.

All experiments are inference\-only; no model training or fine\-tuning is performed\. We evaluate named answer\-generation backbones and evaluator models through hosted or locally served API\-compatible endpoints\. Exact parameter counts are not publicly available for some hosted/proprietary models, so we report model names and access modes where exact sizes cannot be verified\. The offline knowledge base and retrieval indexes are built once and then reused across methods\. We do not tune method hyperparameters on hidden future targets; generation and evaluation use fixed prompt templates, retrieval settings, and metric rubrics\.

##### Human validation protocol\.

Human validation is limited to expert annotation of model outputs and extracted claims\. The validation pool consists of eight AI researchers: five PhD students and three faculty advisors with expertise in artificial intelligence\. Annotators were recruited for expert validation rather than through a crowdsourcing marketplace\. They are asked to follow the rubrics described in Appendix[C\.3](https://arxiv.org/html/2606.00644#A3.SS3): for Reviewer Persuasiveness, they score whether an answer would be convincing to a knowledgeable reviewer for the specified task family; for claim extraction, they check whether extracted units are faithful to the source answer, atomic, decision\-relevant, and sufficiently complete\. Annotators are informed that labels are used only for aggregate validation of the benchmark metrics\. No crowdworker marketplace is used, no private personal data are collected, and no individual\-level annotations are released\. Because the protocol consists of expert assessment of model outputs and benchmark claims, with no intervention on human subjects or collection of sensitive personal data, it is treated as minimal\-risk expert annotation\.

##### Use of AI assistants\.

AI assistants were used during code prototyping, experiment orchestration, result checking, LaTeX editing, and drafting support under author supervision\. The authors made the final decisions about benchmark design, data curation, experimental protocol, reported results, and paper claims\.

## Appendix BBenchmark Construction Details

This appendix expands the construction details behind Figure[2](https://arxiv.org/html/2606.00644#S3.F2)\. The main text defines the benchmark corpus, taxonomy\-based evidence layer, and task families; here we focus on implementation choices that affect cutoff alignment, auditability, and validation\.

### B\.1Corpus Filtering and Temporal Freezing

For each domain, we start with broad domain queries, harvest papers through March 2026, normalize metadata, and deduplicate query hits\. The first pass is recall\-oriented: it retains candidates with non\-trivial domain evidence even when the terminology is not yet stable\.

The domain\-relevance screen considers whether the paper’s main problem, method, system, evaluation, dataset, or application setting is substantively tied to the target area\. The benchmark\-core screen then marks a paper ascorewhen it has a central domain contribution, a concrete research asset type, and useful future\-facing signal for tasks, methods, evaluation, bottlenecks, or design patterns\. We keep weaker relevant papers assupport; borderline cases may be retained for audit; noisy or out\-of\-domain papers are excluded\.

Construction enforces temporal separation\. Public questions and accessible support are frozen at historical cutoffs, while later papers are withheld for validation only\. ForeSci uses three\-month and six\-month forecast settings, and venue\-conditioned tasks follow venue\-cycle timing because conference evidence appears on venue\-specific schedules\. This design gives every system the same pre\-cutoff literature environment and prevents direct retrieval of hidden future papers\.

### B\.2Cutoff Slicing and Forecast\-Oriented Taxonomy Adaptation

Taxonomy induction uses the cutoff\-aligned core/support corpora described in Section[3\.2](https://arxiv.org/html/2606.00644#S3.SS2)\. To preserve temporal structure, papers are processed in chronological slices\. Earlier periods use coarser slices, while periods close to the cutoff use finer slices when short\-horizon movement matters\. This design keeps recent changes visible instead of smoothing them into the older literature\.

Our taxonomy builder follows TaxoAdapt’s multidimensional routing and adaptive expansion principle\(Karguptaet al\.,[2025](https://arxiv.org/html/2606.00644#bib.bib10)\)\. Papers are routed across contribution dimensions such as tasks, methods, datasets, evaluation methods, and application domains\. Dense or poorly covered regions trigger width or depth expansion, allowing new research subdirections to appear as the cutoff\-visible literature evolves\.

We adapt this process for ForeSci in three ways: induction uses filtered core/support papers, temporal slicing emphasizes cutoff\-local deltas, and induced nodes must be grounded in node evidence records before they can support benchmark construction\.

### B\.3Candidate Direction Selection

Candidate directions are selected from taxonomy nodes and small groups of related nodes\. We retain a candidate when it satisfies four criteria: it has sufficient pre\-cutoff support, it expresses a clear research decision, it is specific enough for evaluation, and it can be separated from hidden future validation evidence\. Candidates are revised or removed when they are ambiguous, duplicated, too broad, too narrow, or weakly grounded in the underlying papers\.

### B\.4Method\-Development and Bottleneck Signals

*Method\-development signals*are derived from method and evaluation nodes, paper co\-assignments, title/abstract method surfaces, full\-text evidence, and bottleneck–mechanism cues\. They record relations such as extension, adaptation, replacement, component reuse, and method competition\. These signals provide trajectory evidence for comparing directions and reasoning about mechanism\-level change\.

*Bottleneck signals*summarize recurring limitations, evaluation gaps, reliability or safety concerns, dataset or benchmark needs, and technical risks\. We verify them against full\-text evidence from pre\-cutoff papers, with emphasis on limitations that recur across multiple sources or connect to concrete evaluation failures\.

### B\.5Venue\-Community Profiles

Venue\-conditioned tasks use*venue\-community signals from metadata*\. We construct profiles for venue families such as ACL/EMNLP/NAACL, ICLR/ICML/NeurIPS, AAAI/IJCAI, SIGIR/KDD, and CVPR/ICCV/ECCV\. Each profile summarizes contribution styles, maturity expectations, reviewer risks, evidence\-package expectations, and nearby compatible venue families\. These profiles support venue\-cycle judgments based on research fit, evidence standards, and reviewer expectations\.

### B\.6Human–LLM Collaborative Audit

LLMs draft construction records from cutoff\-aligned evidence by summarizing supporting papers, extracting claims and limitations, identifying method\-development and bottleneck signals, proposing candidate directions, and flagging possible leakage risks\.

Human experts then inspect the source papers and full\-text evidence\. They verify support strength, temporal validity, specificity, and leakage risk; revise unclear wording; remove weak or duplicated items; and approve the final construction records\. The same audit process is applied before task release\.

### B\.7Task Curation and Artifact Separation

Task curation checks three properties: a stable historical premise, a clear public decision, and post\-cutoff evidence suitable for validation\. Items are revised when the decision is underspecified, the validation evidence does not match the requested judgment, or multiple items express the same research decision\.

Each released task contains the public questionqq, cutofftt, forecast window, task\-family instructions, answer requirements, and pre\-cutoff support packetℰ≤t\(q\)\\mathcal\{E\}\_\{\\leq t\}\(q\)\. Internal taxonomy identifiers, representative\-paper lists, construction traces, and audit notes remain internal\. Hidden future validation targets𝒢\>t\(q\)\\mathcal\{G\}\_\{\>t\}\(q\)are held exclusively for the evaluation protocol\.

Domain3\-month6\-monthVenue cycleTotalLLM Agents259023138LLM Fine\-tuning and Post\-training15691599RAG and Retrieval Structuring10691392Visual Generative Modeling and Diffusion197874171Total69306125500

Table A1:Task counts by domain and horizon type\. Three\-month and six\-month columns come directly from task\-level horizon metadata; venue\-conditioned tasks use venue\-cycle timing because their advisory and submission windows are venue specific\.![Refer to caption](https://arxiv.org/html/2606.00644v1/x1.png)Figure A1:Monthly raw\-paper volume in the four ForeSci domains\. Counts are unique papers after normalizing paper identifiers within each domain\-month\. All four domains are shown from January 2023 through the March 2026 benchmark cutoff\.![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/taxonomy_evolution.png)Figure A2:Temporal evolution of the domain corpora and induced taxonomies used in ForeSci\. The curves summarize cumulative domain KB coverage and taxonomy\-node growth over the literature slices used for the release\.##### Publication\-calendar effects\.

The volume curves should be read as a cutoff\-dependent background variable rather than as a direct measure of foresight difficulty\. LLM\-heavy domains show recurring increases around late winter, May–June, and early autumn, broadly matching major AI submission cycles such as ICML\(ICML,[2025](https://arxiv.org/html/2606.00644#bib.bib26)\), ACL\(ACL,[2025](https://arxiv.org/html/2606.00644#bib.bib29)\), KDD\(KDD,[2025](https://arxiv.org/html/2606.00644#bib.bib31)\), SIGIR\(SIGIR,[2025](https://arxiv.org/html/2606.00644#bib.bib32)\), NeurIPS\(NeurIPS,[2025](https://arxiv.org/html/2606.00644#bib.bib27)\), EMNLP\(EMNLP,[2025](https://arxiv.org/html/2606.00644#bib.bib30)\), and ICLR\(ICLR,[2025](https://arxiv.org/html/2606.00644#bib.bib28)\)\. Visual generation also exhibits spring and year\-end structure consistent with CV venue cycles such as CVPR\(CVPR,[2025](https://arxiv.org/html/2606.00644#bib.bib33)\), ICCV\(ICCV,[2025](https://arxiv.org/html/2606.00644#bib.bib34)\), and ECCV\(ECCV,[2024](https://arxiv.org/html/2606.00644#bib.bib35)\)\. These regularities motivate explicit temporal cutoffs and horizon metadata, so that task construction separates genuine post\-cutoff research change from predictable seasonality induced by publication and review calendars\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/taxonomy_example.png)Figure A3:Illustrative examples of temporal taxonomy branching in ForeSci\. The figure shows how broad pre\-cutoff topics split into task\-seeding subdirections; Table[A2](https://arxiv.org/html/2606.00644#A2.T2)summarizes representative branch examples in text form\.![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/method_evolution.png)Figure A4:Method\-evolution signals for foresight task construction\. Rather than tracking topic frequency alone, ForeSci models each domain as an evolutionary chain in which limitations of prior methods expose bottleneck cues, emerging technical responses provide mechanism cues, and the resulting bottleneck–mechanism interaction points toward a future research shift\. The examples show this progression for LLM agents, LLM fine\-tuning and post\-training, RAG and retrieval structuring, and visual generative modeling\. The bottom row summarizes how bottleneck, mechanism, and trade\-off cues feed the four downstream task families: direction forecasting, bottleneck–opportunity discovery, strategic research planning, and venue\-conditioned positioning\.DomainRepresentative taxonomy subtreeRepresentative method\-evolution signalLLM AgentsTool use and function calling→\\rightarrowtool\-augmented task planning→\\rightarrowtool\-augmented graph\-based task planning; tool usage benchmarks→\\rightarrowAPI / interactive / multimodal tool usage benchmarksGeneric tool use and task planning→\\rightarrowtool reliability and long\-horizon grounding bottlenecks→\\rightarrowstructured tool orchestration, memory, feedback, and verification loops→\\rightarrowdomain\-grounded scientific workflow agentsLLM Fine\-tuning and Post\-trainingInstruction tuning datasets→\\rightarrowdata\-efficient instruction tuning datasets / synthetic instruction tuning datasets; instruction tuning→\\rightarrowdomain\-specific, multimodal, and chain\-of\-thought instruction tuningCurated instruction and preference data→\\rightarrowdata cost, coverage, and weak reasoning\-trace bottlenecks→\\rightarrowdata selection, synthetic feedback, self\-correction, and process supervision→\\rightarrowsynthetic and reasoning\-aware post\-training pipelinesRAG and Retrieval StructuringIterative retrieval\-generation pipelines→\\rightarrowretrieval strategy evaluation→\\rightarrowadaptive / hybrid / multimodal retrieval evaluation; reasoning\-aware evaluation→\\rightarrowevidence\-aligned reasoning evaluationRetrieve\-then\-generate pipelines→\\rightarrowhallucination, multi\-hop grounding, stale evidence, and citation\-fidelity bottlenecks→\\rightarrowadaptive retrieval, query decomposition, evidence verification, and reasoning\-aware reranking→\\rightarrowevidence\-aligned agentic RAG systemsVisual Generative Modeling and DiffusionTemporal consistency in video generation→\\rightarrowspatio\-temporal consistency metrics→\\rightarrowtext\-to\-video / 3D\-aware / physics\-aware consistency metrics; video diffusion transformer architectures→\\rightarrowefficient / image\-to\-video / video\-editing diffusion transformersImage\-level diffusion generation and static fidelity metrics→\\rightarrowtemporal inconsistency, controllability, physical plausibility, and long\-form coherence bottlenecks→\\rightarrowvideo diffusion transformers, motion\-aware conditioning, 3D/physics constraints, and consistency metrics→\\rightarrowcontrollable long\-form and physics\-aware video generationTable A2:Representative taxonomy subtrees and method\-evolution signals used during ForeSci construction\. The taxonomy column illustrates how temporally induced domain structures refine broad research areas into fine\-grained subdirections\. The method\-evolution column summarizes the complementary bottleneck–mechanism–shift patterns used to seed future\-facing task targets\.The evolution curves in Figure[A2](https://arxiv.org/html/2606.00644#A2.F2)provide two useful checks on benchmark construction\. First, they show that the domains differ substantially in corpus scale and structural breadth, which motivates using a shared construction protocol across diverse technical areas\. Second, the branch examples in Figure[A3](https://arxiv.org/html/2606.00644#A2.F3)and method\-evolution signals in Figure[A4](https://arxiv.org/html/2606.00644#A2.F4)show that broad nodes split into increasingly specialized descendants and recurring bottlenecks become concrete mechanisms over time rather than remaining a static flat inventory of labels\. This is why ForeSci separates temporal taxonomy induction, public support construction, and hidden future supervision\.

## Appendix CEvaluation Protocol

### C\.1Metric Calibration Details

This appendix gives the formulas and weighting details for the evaluation protocol introduced in Section[4](https://arxiv.org/html/2606.00644#S4)\. The reported metrics are Prediction Factuality, Future\-Target Alignment, Evidence Traceability Score, and Reviewer Persuasiveness\.

##### Prediction Factuality\.

Letaadenote a candidate answer and let𝒞\(a\)=\{ci\}i=1m\\mathcal\{C\}\(a\)=\\\{c\_\{i\}\\\}\_\{i=1\}^\{m\}be its extracted atomic claims\. The reported Prediction Factuality score is claim\-level F1 over answer\-claim support and hidden claim\-bank coverage\. For each extracted answer claimcic\_\{i\}, a benchmark\-aware verifier assignssupported,partially supported,unsupported, ornot checkablerelative to the public task and hidden claim units\. Letϕi∈\{1,0\.5,0,0\}\\phi\_\{i\}\\in\\\{1,0\.5,0,0\\\}be the corresponding answer\-claim support score\. Claim precision is

Prec\(a\)=1m∑i=1mϕi\.\\mathrm\{Prec\}\(a\)=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\phi\_\{i\}\.\(3\)Let𝒢\+=\{gj\}j=1n\\mathcal\{G\}^\{\+\}=\\\{g\_\{j\}\\\}\_\{j=1\}^\{n\}denote the expanded hidden claim bank for the task\. For each hidden claim, the judge assignscovered,partially covered, ornot coveredby the candidate answer\. Letψj∈\{1,0\.5,0\}\\psi\_\{j\}\\in\\\{1,0\.5,0\\\}be the corresponding coverage score\. Claim recall is

Rec\(a\)=1n∑j=1nψj\.\\mathrm\{Rec\}\(a\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\\psi\_\{j\}\.\(4\)The final score is

PredictionFactuality\(a\)=2Prec\(a\)Rec\(a\)Prec\(a\)\+Rec\(a\)\.\\mathrm\{PredictionFactuality\}\(a\)=\\frac\{2\\,\\mathrm\{Prec\}\(a\)\\,\\mathrm\{Rec\}\(a\)\}\{\\mathrm\{Prec\}\(a\)\+\\mathrm\{Rec\}\(a\)\}\.\(5\)Precision and recall are retained as intermediate quantities; the reported metric is the F1 score\.

##### Future\-Target Alignment \(FTA\)\.

FTA is family\-conditioned\. For direction forecasting and bottleneck–opportunity tasks, we use Reference\-Guided FTA\. The hidden future target is represented as a set of slots𝒢b=\{gj\}j=1n\\mathcal\{G\}\_\{b\}=\\\{g\_\{j\}\\\}\_\{j=1\}^\{n\}\. Each slotgjg\_\{j\}contains one or more acceptable textual variantsVjV\_\{j\}, such as a primary future target, a root\-bottleneck paraphrase, an unlocked opportunity variant, or an evidence\-backed mechanism variant\. Variants are alternatives inside the same target slot; they are not counted as additional targets\. Negative confusions are retained for qualitative auditing but are not counted as positive slots\.

Let𝒞\(a\)=\{ci\}i=1m\\mathcal\{C\}\(a\)=\\\{c\_\{i\}\\\}\_\{i=1\}^\{m\}be the benchmark\-relevant prediction claims extracted from the answer\. We embed every claim and target variant with bge\-m3 and computes\(c,t\)=max⁡\(0,cos⁡\(e\(c\),e\(t\)\)\)s\(c,t\)=\\max\(0,\\cos\(e\(c\),e\(t\)\)\), placing pairwise similarity on a 0–1 scale\. The slot score is the best claim–variant match inside the slot:

Sj\(a,b\)=max1≤i≤m,t∈Vj⁡s\(ci,t\)\.S\_\{j\}\(a,b\)=\\max\_\{1\\leq i\\leq m,\\;t\\in V\_\{j\}\}s\(c\_\{i\},t\)\.\(6\)The Reference\-Guided FTA score is the mean slot score:

FTARG\(a,b\)=1n∑j=1nSj\(a,b\)\.\\mathrm\{FTA\}\_\{\\mathrm\{RG\}\}\(a,b\)=\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}S\_\{j\}\(a,b\)\.\(7\)This score is intentionally not an F1 and has no code\-side cap or weighted combination term\.

For strategic planning and venue positioning, the hidden target is an ordered listπ∗=\(r1,…,rK\)\\pi^\{\*\}=\(r\_\{1\},\\ldots,r\_\{K\}\)over candidate directions or venues\. The answer is parsed into an inferred rankingπ^\\hat\{\\pi\}\. We score three deterministic components: whether the top item matches, whether each preferred item appears in the same position or elsewhere, and whether pairwise order relations are preserved\. Let

Stop\\displaystyle S\_\{\\mathrm\{top\}\}=𝕀\[π^1=r1\],\\displaystyle=\\mathbb\{I\}\[\\hat\{\\pi\}\_\{1\}=r\_\{1\}\],\(8\)Spos\\displaystyle S\_\{\\mathrm\{pos\}\}=1K∑k=1Ksk,\\displaystyle=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}s\_\{k\},\(9\)Spair\\displaystyle S\_\{\\mathrm\{pair\}\}=2K\(K−1\)∑i<j𝕀\[ri≺π^rj\]\.\\displaystyle=\\frac\{2\}\{K\(K\-1\)\}\\sum\_\{i<j\}\\mathbb\{I\}\[r\_\{i\}\\prec\_\{\\hat\{\\pi\}\}r\_\{j\}\]\.\(10\)wheresk=1s\_\{k\}=1whenrkr\_\{k\}appears in positionkk,sk=0\.5s\_\{k\}=0\.5when it appears in another inferred position, andsk=0s\_\{k\}=0when it is missing\. The recall\-style ranking score is the mean of these three components\. A symmetric precision\-style score is computed over the inferred ranking using the same top, position, and pairwise\-order checks relative toπ∗\\pi^\{\*\}\. The reported ranking\-aware FTA is the F1 of these precision and recall terms:

FTArank\(a,b\)=2PRP\+R\.\\mathrm\{FTA\}\_\{\\mathrm\{rank\}\}\(a,b\)=\\frac\{2PR\}\{P\+R\}\.\(11\)The reported FTA isFTARG\\mathrm\{FTA\}\_\{\\mathrm\{RG\}\}for Direction and Bottleneck tasks, andFTArank\\mathrm\{FTA\}\_\{\\mathrm\{rank\}\}for Planning and Venue tasks\.

##### Evidence Traceability Score\.

The traceability evaluator scores external evidence linkagee\(a\)e\(a\), support specificitys\(a\)s\(a\), and answer\-internal tracet\(a\)t\(a\)from the support snapshot attached to a method output\. The final score is

Trace\(a\)=0\.50e\(a\)\+0\.25s\(a\)\+0\.25t\(a\)\.\\mathrm\{Trace\}\(a\)=0\.50\\,e\(a\)\+0\.25\\,s\(a\)\+0\.25\\,t\(a\)\.\(12\)If a method output has no attached external support artifact, Evidence Traceability is not applicable and is reported as – in paper\-facing tables\. We compute Traceability aggregates, low\-score rates, and cross\-model trace\-profile correlations only over evidence\-grounded methods\.

##### Reviewer Persuasiveness\.

Reviewer Persuasiveness is a deliberative LLM\-as\-judge metric over how persuasive the research decision would be to a rubric\-based virtual reviewer, rather than over factual overlap alone\. The evaluator first builds a family\-conditioned research brief from the public task and hidden decision targets\. For direction forecasting, the task\-specific dimensions are forecast specificity, signal interpretation, trajectory discipline, and uncertainty calibration\. For venue positioning, they are venue\-fit reasoning, reviewer\-expectation awareness, package\-upgrade specificity, and contrastive venue discrimination\. For bottleneck–opportunity tasks, they are causal bottleneck analysis, opportunity plausibility, technical non\-obviousness, and adoption\-pathway reasoning\. For strategic planning, they are milestone specificity, dependency\-chain quality, experiment executability, and risk or kill criteria\. The evaluator also reports generic decision clarity, mechanistic reasoning, comparative reasoning, and uncertainty or risk awareness, then assigns a holistic persuasiveness score\. Because this metric is intended to approximate a peer\-review\-style research call under uncertainty, repeated evaluator runs are used to make variance visible\.

### C\.2Evaluation Prompt Templates

This appendix summarizes the main prompt templates used by the benchmark evaluation stack\. The templates below are lightly edited for readability, but they preserve the operative instructions, rubrics, and output schemas used in the implementation\.

Atomic Claim Extraction PromptYou are extracting benchmark\-relevant atomic factual claims from a candidate answer\.Inputs:•Public task definition:`\{PUBLIC\_TASK\_JSON\}`•Candidate answer:`\{CANDIDATE\_ANSWER\}`Your task:•Decompose the answer into a small set of atomic factual claims\.•Keep only claims that could be grounded in a research\-paper benchmark\.•Prefer claims about research directions, mechanisms, bottlenecks, venue fit, baselines, experiments, risks, trajectory, and evidence signals\.•Ignore purely stylistic statements, generic advice, and duplicates\.•Return at most`\{MAX\_CLAIMS\}`claims\.Output requirements:•Output JSON only\.•Do not include explanations, markdown, or extra text\.Output format:\{"claims": \["\.\.\.", "\.\.\."\]\}

Prompt template used for atomic claim extraction in Prediction Factuality\.

Claim\-Level Factuality and Coverage PromptYou are a claim\-level evaluator for a research\-foresight benchmark\. Compute claim precision and gold\-claim recall for the candidate answer\. This is a factual/content coverage metric, not a research\-taste metric\.Inputs:•Public task definition:`\{PUBLIC\_TASK\_JSON\}`•Candidate answer claims:`\{ANSWER\_CLAIMS\_JSON\}`•Gold claim units:`\{GOLD\_CLAIM\_UNITS\_JSON\}`Judgment rules:•For answer claims, label each as`supported`,`partially\_supported`,`unsupported`, or`not\_checkable`relative to the public task and gold claim units\.•For gold claim units, label coverage as`covered`,`partially\_covered`, or`not\_covered`by the candidate answer\.•Do not reward generic topical overlap when the mechanism, venue rationale, or decision is different\.•Keep this metric independent of future\-target alignment scoring\.Output format:\{"answer\_claim\_verdicts": \[\{"claim": "\.\.\.", "label": "supported \| partially\_supported \| unsupported \| not\_checkable","matched\_gold\_claim\_ids": \["\.\.\."\], "rationale": "\.\.\."\}\],"gold\_claim\_coverage": \[\{"gold\_claim\_id": "\.\.\.", "label": "covered \| partially\_covered \| not\_covered","matched\_answer\_claims": \["\.\.\."\], "rationale": "\.\.\."\}\]\}

Prompt template used for claim\-level factuality and coverage in Prediction Factuality\.

Evidence Traceability PromptYou are an Evidence Traceability Auditor for a research benchmark\.Objective:•Judge whether the answer’s important conclusions can be traced to explicit support artifacts provided with the method output\.•Do not judge factual truth against outside knowledge\.•Reward explicit linkage from claims to papers, snippets, evidence bundles, or reasoning traces\.•Penalize answers that appear strong but cannot be connected to identifiable support\.Inputs:•Public task definition:`\{PUBLIC\_TASK\_JSON\}`•Candidate answer:`\{CANDIDATE\_ANSWER\}`•Extracted evidence\-content brief:`\{SUPPORT\_BRIEF\_JSON\}`Rubric:•`evidence\_linkage`: are the answer’s main claims visibly connected to explicit evidence items, retrieved papers, snippets, or trace steps?•`support\_specificity`: is the support concrete and discriminative enough that a reviewer could audit why these conclusions were reached?•`answer\_internal\_trace`: does the answer explain a coherent path from evidence to conclusion, rather than only listing anchors?Output format:\{"dimension\_scores": \{"evidence\_linkage": 0\.0, "support\_specificity": 0\.0,"answer\_internal\_trace": 0\.0\},"traceability\_score": 0\.0,"strengths": \["\.\.\."\],"weaknesses": \["\.\.\."\],"rationale": "\.\.\."\}

Prompt template used for Evidence Traceability auditing\.

Reviewer Persuasiveness PromptYou are a Reviewer Persuasiveness evaluator for ForeSci\.Objective:•Assess the quality of the candidate answer as a research decision under uncertainty\.•Use the reference brief as a structured decision neighborhood, not as a literal answer key\.•Reward clear tradeoffs, causal or venue\-fit mechanism, concrete decision criteria, appropriate uncertainty, and explicit handling of alternatives or risks\.•Penalize noncommittal lists, generic prose, unsupported drift from the task, and missing task\-specific decision requirements\.Inputs:•Public task definition:`\{PUBLIC\_TASK\_JSON\}`•Reviewer\-persuasiveness brief:`\{RESEARCH\_JUDGEMENT\_BRIEF\_JSON\}`•Candidate answer:`\{CANDIDATE\_ANSWER\}`•Task\-specific dimensions for this family:`\{TASK\_SPECIFIC\_DIMENSIONS\_JSON\}`Output format:\{"research\_judgment\_score": 0\.0,"generic\_dimension\_scores": \{"decision\_clarity": 0\.0,"mechanistic\_reasoning": 0\.0,"comparative\_reasoning": 0\.0,"uncertainty\_or\_risk\_awareness": 0\.0\},"task\_specific\_dimension\_scores": \{"dimension\_name": 0\.0\},"strengths": \["\.\.\."\],"weaknesses": \["\.\.\."\],"rationale": "\.\.\."\}

Prompt template used for family\-conditioned Reviewer Persuasiveness\.

### C\.3Human Validation

We conduct human\-validation studies over the formal evaluation artifacts\. The studies check whether the rubric\-based Reviewer Persuasiveness score tracks expert preferences and whether automatic claim extraction produces decision\-critical units\.

Validation targetHuman sampleMain agreement resultMain diagnostic conclusionReviewer Persuasiveness400 answersMean Spearman 0\.57; mean top\-1 agreement 50\.0%The rubric evaluator is positively aligned with expert scoring, but planning exposes sensitivity to polished scaffolded prose\.Claim extraction240 answersOverallF1F\_\{1\}0\.77; atomicity pass 85\.5%The extractor is precise and usually atomic, but it misses implicit causal bridges and over\-splits verbose planning or venue answers\.Table A3:Summary of human\-validation analyses for the ForeSci evaluation stack\.#### C\.3\.1Reviewer Persuasiveness Human Validation

We first evaluate whether the automatic Reviewer Persuasiveness score reflects expert judgments on open\-ended research decisions\. The validation set contains 400 long\-form model answers, balanced across the four task families\. For each family, we compare human expert scores with the automatic DeepSeek\-V4 Reviewer Persuasiveness score using Spearman correlation \(Table[A4](https://arxiv.org/html/2606.00644#A3.T4)\) and ranking comparison \(Table[A5](https://arxiv.org/html/2606.00644#A3.T5)\)\. Overall, the automatic Reviewer Persuasiveness metric shows strong agreement with human expert judgments across task families, supporting its validity as a reasonable proxy for human assessment\.

FamilynnPearsonBottleneck1000\.62Direction1000\.66Strategic Planning1000\.54Venue1000\.58Table A4:Human expert scores versus automatic Reviewer Persuasiveness scores, reported by task family\.FamilyHuman rankingAutomatic rankingKendall’sτ\\tauBottleneckARIS≻\\succCoI≻\\succRA≻\\succRAG≻\\succNativeARIS≻\\succRA≻\\succCoI≻\\succRAG≻\\succNative0\.80DirectionRAG≻\\succARIS≻\\succCoI≻\\succRA≻\\succNativeARIS≻\\succRAG≻\\succCoI≻\\succRA≻\\succNative0\.80Strategic PlanningRA≻\\succARIS≻\\succCoI≻\\succRAG≻\\succNativeRA≻\\succARIS≻\\succCoI≻\\succRAG≻\\succNative1\.00VenueARIS≻\\succRA≻\\succRAG≻\\succCoI≻\\succNativeARIS≻\\succRA≻\\succCoI≻\\succRAG≻\\succNative0\.80Table A5:Method\-ranking comparison between human experts and automatic Reviewer Persuasiveness\. RA denotes ResearchAgent\-style and RAG denotes Hybrid RAG\.
#### C\.3\.2Claim Extraction Human Validation

We next validate the claim extractor used by Prediction Factuality\. The audit covers 240 answers, with 60 answers per family \(Table[A6](https://arxiv.org/html/2606.00644#A3.T6)\)\. Human annotators assess whether extracted units are precise, decision\-relevant, atomic, and sufficiently complete relative to the source answer\. Overall, the extractor achieves strong human\-validated quality across all four task families, with high precision, solid recall, and consistently high atomicity and decision relevance\. These results support the reliability of Prediction Factuality as a claim\-level measure of evidence\-grounded research judgment\.

FamilynnPrec\.Rec\.F1F\_\{1\}AtomicRelevantNoiseBottleneck600\.840\.720\.7888\.5%85\.0%11\.2%Direction600\.860\.760\.8191\.0%89\.5%8\.0%Strategic Planning600\.790\.700\.7482\.3%81\.0%16\.5%Venue600\.780\.730\.7580\.1%83\.0%18\.2%Table A6:Human validation of automatic decision\-critical claim extraction\.

## Appendix DSystem Adaptation Details

The agent\-based systems compared in the main paper were not evaluated in their original open\-web form\. All three were adapted intocutoff\-faithful offline agentsso that they operate inside the benchmark’s pre\-cutoff knowledge boundary and expose artifacts compatible with the ForeSci metric stack\.

##### Shared adaptation principles\.

Across all agent baselines, we applied the same three benchmark\-side constraints\.

- •Open\-web removal\.All online literature search or browsing steps were replaced by offline access to the benchmark knowledge base, including paper metadata, section\-level snippets, and benchmark\-side support packets\.
- •Cutoff\-faithful evidence flow\.Retrieval, intermediate reasoning, and final answer rendering were constrained to use only pre\-cutoff assets\. Hidden future evidence remained strictly evaluation\-only\.
- •Benchmark\-native output exposure\.Each adapted agent was modified to emit the support artifacts needed by the benchmark judges, such as retrieval traces, evidence items, and family\-native structured answer fields when available\.

##### Agent\-specific adaptations\.

For CoI\-style, following the open\-web removal and cutoff\-faithful evidence\-flow principles, we replaced its online literature\-search module with an offline benchmark\-KB retrieval adaptor while retaining the multi\-chain idea\-construction module\. For ResearchAgent\-style, following the benchmark\-native output and task\-family alignment principles, we replaced web\-facing evidence gathering with benchmark support\-packet access and modified the internal planning/review prompts into task\-family\-aware modules for bottleneck, forecasting, strategic\-planning, and venue\-positioning tasks\. For ARIS\-style, following all three shared principles, we replaced open retrieval with benchmark\-KB hybrid retrieval and family\-specific evidence construction, and modified the final renderer into a contract\-aware output module so that answers respect the candidate set, task family, and venue\-facing framing required by ForeSci\.

## Appendix ESupplementary Metric Results

### E\.1Evaluation results for each task family

Table[A7](https://arxiv.org/html/2606.00644#A5.T7)reports the evaluation results separately for each task family\. The results show that method rankings vary across task families, suggesting that different foresight tasks benefit from different agent designs and that no single agent consistently dominates across all task types\.

Qwen3\-235BGPT\-5\.2GLM\-4\.6Gemini\-3FamilyMethodFact\.FTATracePers\.Fact\.FTATracePers\.Fact\.FTATracePers\.Fact\.FTATracePers\.Bottleneck–OpportunityNative LLM0\.8400\.608–0\.7640\.7990\.614–0\.8470\.6510\.595–0\.6310\.6720\.576–0\.742Hybrid RAG0\.8240\.6090\.4390\.7580\.8040\.6130\.4530\.8340\.6230\.5960\.4240\.5900\.7040\.5890\.3420\.742CoI0\.8510\.6070\.5600\.7480\.8190\.6100\.7020\.8620\.6830\.5960\.4850\.6120\.7000\.5830\.4010\.727ResearchAgent0\.8020\.6080\.5710\.7500\.8290\.6100\.6700\.8590\.6410\.5970\.4760\.5900\.7160\.5910\.4150\.716ARIS0\.8600\.6070\.5540\.7770\.8130\.6070\.6930\.8560\.6950\.5960\.4860\.6050\.7350\.5760\.4760\.735Direction ForecastingNative LLM0\.5980\.635–0\.7860\.6700\.635–0\.8390\.5230\.604–0\.6740\.6180\.601–0\.733Hybrid RAG0\.5670\.6370\.5840\.7760\.6450\.6340\.5760\.8280\.5150\.6040\.5330\.6600\.5890\.6060\.5900\.732CoI0\.5770\.6350\.7450\.8110\.6580\.6370\.7480\.8660\.5280\.6030\.6120\.6830\.6010\.6020\.5990\.737ResearchAgent0\.5680\.6330\.7220\.8030\.6620\.6330\.7410\.8580\.5120\.5960\.5770\.6740\.5950\.6040\.5560\.733ARIS0\.5420\.6270\.7430\.8180\.5890\.6280\.7450\.8660\.5060\.6010\.6160\.6570\.5730\.6000\.6490\.715Strategic PlanningNative LLM0\.4650\.580–0\.7560\.4410\.548–0\.8180\.3590\.472–0\.6640\.4210\.554–0\.744Hybrid RAG0\.4460\.5540\.3320\.7370\.4460\.5530\.3640\.8090\.4200\.5320\.3720\.6700\.4220\.5210\.3540\.723CoI0\.5040\.6440\.4290\.7590\.4760\.5820\.5560\.8220\.4700\.6380\.4450\.6890\.4140\.5280\.4140\.725ResearchAgent0\.5320\.6870\.4600\.7780\.4810\.5770\.5600\.8270\.4710\.6370\.4710\.6820\.4240\.5440\.4050\.730ARIS0\.4600\.5980\.5420\.7600\.4880\.5870\.6830\.8340\.4500\.5770\.5120\.6770\.4220\.5550\.5540\.747Venue PositioningNative LLM0\.5080\.664–0\.8380\.5620\.715–0\.8820\.5030\.665–0\.7290\.5080\.678–0\.744Hybrid RAG0\.5520\.7220\.3750\.8310\.5450\.7020\.2410\.8780\.5220\.6940\.3870\.7120\.5050\.6510\.3680\.684CoI0\.5150\.6840\.5060\.8100\.5500\.6980\.3640\.8790\.4900\.6210\.4550\.6650\.5190\.6880\.4760\.746ResearchAgent0\.5340\.7110\.4980\.8170\.5670\.7120\.3660\.8830\.5340\.6760\.4720\.6760\.5040\.6720\.4600\.736ARIS0\.5670\.7450\.5930\.8170\.5790\.7450\.3870\.8890\.4990\.6840\.4660\.6580\.4870\.6550\.5910\.737

Table A7:Main family\-level results across four answer\-generation backbones\. Each family block reports five methods; bold marks the best method within the same family, backbone, and metric\.
### E\.2Scalar Metric Stability

We also test the run\-to\-run stability of the scalar metric stack on the same Qwen3\-235B 100\-task subset used for the preference study\. The subset contains 500 evaluated rows after expanding 100 tasks by five methods\. We keep the candidate answers fixed and repeat the metric evaluation five times: one existing formal\-evaluation run plus four independent DeepSeek\-V4 replicate runs\. For each metric, Table[A8](https://arxiv.org/html/2606.00644#A5.T8)reports the largest standard deviation and largest range of method\-level run means across all family–method cells\. For Evidence Traceability, Native LLM rows are excluded because the metric is not applicable without an external support artifact\. It also reports a row\-level diagnostic: the mean per\-row standard deviation\.

MetricMax SDMax rangeRow SDPrediction Factuality0\.0210\.0530\.045Future\-Target Alignment0\.0060\.0140\.004Evidence Traceability0\.0500\.1110\.108Reviewer Persuasiveness0\.0210\.0530\.034Table A8:Five\-run stability of scalar metrics on the Qwen3\-235B 100\-task subset\. “Max mean SD” and “Max mean range” summarize method\-level run means across family–method cells\. “Mean row SD” summarize per\-row variation across the same five runs\.Future\-Target Alignment is the most stable metric: Planning and Venue FTA are deterministic ranking\-aware scores, while Bottleneck and Direction use reference\-guided embedding similarity over extracted prediction claims\. Prediction Factuality and Reviewer Persuasiveness show moderate variance; they support family\-level comparisons but close within\-family method rankings should be interpreted as small\-margin differences\. Evidence Traceability has the largest variance, but its variance remains moderate and within an acceptable range, because it asks a rubric\-style evaluator to assess evidence linkage and support specificity from method artifacts\.

GeneratorFamilyMeanMedianP90MaxGLM\-4\.6Bottleneck–Opportunity124\.0120159\.0228GLM\-4\.6Direction Forecasting97\.396120\.0209GLM\-4\.6Strategic Planning154\.4154189\.0257GLM\-4\.6Venue Positioning400\.4397500\.6656GPT\-5\.2Bottleneck–Opportunity336\.3324427\.2682GPT\-5\.2Direction Forecasting232\.6228301\.0394GPT\-5\.2Strategic Planning257\.2255291\.6360GPT\-5\.2Venue Positioning1276\.012681448\.21704Gemini\-3Bottleneck–Opportunity196\.3193256\.0320Gemini\-3Direction Forecasting154\.5155180\.0214Gemini\-3Strategic Planning203\.8204226\.6542Gemini\-3Venue Positioning526\.8517631\.8744Qwen3\-235BBottleneck–Opportunity196\.3190259\.6381Qwen3\-235BDirection Forecasting171\.8172216\.0284Qwen3\-235BStrategic Planning172\.2173195\.0226Qwen3\-235BVenue Positioning598\.1563769\.01045Table A9:Answer\-length profiles on aligned outputs\. Lengths are word\-like token counts\. Each row contains 625 answers from one generator backbone and one task family\.
### E\.3Backbone Style Profile

We observe that different backbone models show different answer styles\. Table[A9](https://arxiv.org/html/2606.00644#A5.T9)reports word\-like answer lengths for aligned formal\-release outputs\. All four generator backbones are aligned on the same 2,500\(task\_id,method\)\(\\mathrm\{task\\\_id\},\\mathrm\{method\}\)keys, and each family row contains 625 answers\. We further test whether the generator backbone changes the linguistic rendering of the same benchmark tasks\. This audit uses a fully crossed matched sample of 80 tasks, with 20 tasks per family and all four backbones and five methods, yielding 1,600 blind answer\-level annotations and 8,000 style labels\. The judge sees only the task question and candidate answer; it does not see the reference answer, hidden target, method name, backbone name, or scalar metrics\. We retain four paper\-facing dimensions: decision directness, mechanistic concreteness, structural scaffold intensity, and verbosity/compression\. Figure[A5](https://arxiv.org/html/2606.00644#A5.F5)shows that style variation is mainly backbone\-driven\. Fixing the task and method while varying the backbone gives larger label disagreement than fixing the task and backbone while varying the method for structural scaffold intensity \(0\.364 vs\. 0\.250\), verbosity/compression \(0\.292 vs\. 0\.168\), decision directness \(0\.086 vs\. 0\.031\), and mechanistic concreteness \(0\.106 vs\. 0\.082\)\. The objective surface features support the same interpretation: GPT\-5\.2 answers are much longer and more visibly scaffolded, averaging 529 words, 8\.6 bullet lines, and 4\.3 heading lines per answer; GLM\-4\.6 is the most compressed, averaging 191 words and almost no expansive answers; Gemini\-3 averages 273 words with 3\.2 bullet lines, and Qwen3\-235B averages 284 words with 2\.2 bullet lines\. We therefore interpret backbone sensitivity partly as a rendering effect: different generators package similar research decisions with different amounts of structure, directness, and mechanistic detail\. These results also motivate the use of Fact, FTA, and Trace as complementary metrics that directly assess agreement with future facts and traceable pre\-cutoff evidence, since a purely LLM\-as\-a\-judge metric such as Reviewer Persuasiveness may be partially confounded by backbone\-specific rendering style\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/backbone_style_profile_summary.png)Figure A5:Supplementary backbone language\-style profile over 1,600 blind answer annotations\. The matched sample contains 80 tasks, all four answer\-generation backbones, and all five methods\. Panel \(a\) compares matched backbone variation against matched method variation\. Panel \(b\) reports salient label rates by backbone: direct recommendation, concrete mechanism, high structural scaffold, and expansive verbosity\. Panel \(c\) reports objective surface features; bullet and heading counts are scaled for visualization, with original mean counts printed above bars\.

## Appendix FDiagnostic Analyses

### F\.1Detailed Error Analysis

We conduct an internal error analysis over all 10,000 formal\-release evaluations, using the same answer, metric, and hidden\-goldset files as the main results\. The analysis is diagnostic rather than a new evaluation: it does not call an LLM and does not rewrite metrics\. It reads evaluator rationales, gold\-claim coverage summaries, future\-unit judgments, and support\-profile metadata from the synchronized evaluation files\. For each metric, we mark the rank\-selected bottom 20% of rows as low scoring; this avoids mixing different score scales while keeping the diagnostic sample size fixed\. Traceability is computed only over evidence\-grounded methods because Native LLM outputs do not expose external support artifacts\. These labels are used only to identify cases for inspection\.

GroupingUnitLow FactLow FTALow Pers\.Low TraceFamilyBottleneck–Opportunity0\.0700\.0130\.2810\.276Direction Forecasting0\.1840\.0040\.1890\.043Strategic Planning0\.3150\.5120\.2030\.287Venue Positioning0\.2320\.2710\.1270\.195MethodNative LLM0\.2160\.2180\.172–Hybrid RAG0\.2070\.2110\.2080\.323CoI\-style0\.1990\.1940\.2030\.189ResearchAgent\-style0\.1830\.1780\.2030\.173ARIS\-style0\.1950\.1990\.2140\.115Table A10:Low\-score rates used for the diagnostic error analysis\. Low\-score rows are the rank\-selected bottom 20% for each metric\. Evidence Traceability is computed over evidence\-grounded methods only, so Native LLM traceability is not applicable\.Table[A10](https://arxiv.org/html/2606.00644#A6.T10)shows where the failures concentrate\. Strategic Planning is the hardest family: low Prediction Factuality is 0\.315 and low Future\-Target Alignment is 0\.512, largely because a wrong top priority can invalidate an otherwise plausible plan\. Venue Positioning has substantial low factuality and Future\-Target Alignment failure rates, while its evidence\-grounded low\-trace rate is lower after the venue\-specific Trace normalization \(0\.195\)\. Bottleneck and Direction tasks have near\-zero low\-FTA rates under the repaired reference\-guided semantic FTA, so their remaining errors are mainly claim\-level or decision\-object mismatches rather than global future\-target misses\. Native LLM has no external support artifact, so Traceability is reported as not applicable rather than as a zero score; low\-trace diagnostics below refer to evidence\-grounded methods only\.

The Fact–FTA conflict audit further separates semantic proximity from exact decision correctness\. The refreshed audit finds 131 large disagreements: 112 rows have an absolute Fact–FTA gap of at least 0\.5, and 19 rows are high\-FTA/low\-Fact cases\. These high\-FTA/low\-Fact cases are mostly in Direction Forecasting, where the answer is near the future target space but misses the exact mechanism, trajectory label, or scope\. We therefore treat large disagreements as analysis signals rather than metric failures\.

Planning diagnostics show that the dominant failure is top\-priority error: within planning rows, the top\-priority\-wrong rate ranges from 0\.468 for ResearchAgent\-style to 0\.586 for Native LLM\. These answers often justify individual candidates well but do not support the full global ordering\. Venue diagnostics show a different failure: among evidence\-grounded methods, most low\-trace venue cases are topical\-not\-venue\-specific evidence chains\. Hybrid RAG has the highest such low\-trace rate \(0\.312\), while agent methods reduce but do not remove the problem \(0\.082–0\.192\)\. This distinction is why the main text treats Planning as a rank\-order sensitivity problem and Venue as an evidence\-specificity problem\.

MethodFingerprintDiagnostic interpretationNative LLMFluent but unanchored adjacent answers\.Trace is not applicable because no external support artifact is exposed; factual and persuasive errors often come from plausible priors rather than the target decision\.Hybrid RAGRetrieved context without enough decision grounding\.Retrieved evidence is often topical, but the final answer may not connect it to the required bottleneck, ranking, or venue\-conditioned comparison\.CoI\-styleStrong local evidence, wrong global decision\.The reasoning chain can be highly auditable while promoting a locally supported technical issue, such as TOCTOU, into the root decision object\.ResearchAgent\-styleCoherent scaffold, sometimes wrong root premise\.This method has the lowest overall low\-Fact and low\-FTA rates, and is strongest on Planning, but a wrong early abstraction can be amplified into a complete causal story\.ARIS\-styleSpecific mechanism over\-commitment\.Concrete mechanism selection helps Bottleneck tasks, but can narrow Direction forecasts toward a specific instance when the reference target is a broader future mechanism cluster\.Table A11:Method\-internal fingerprints from the diagnostic error audit\. The table summarizes why low\-scoring cases fail within each method\.
### F\.2Bias–Metric Coupling Analysis

The preceding audits identify answer\-reference drift, but do not by themselves show which drifts are responsible for formal metric failures\. We therefore join the same 1,600 bias\-annotated answers with the repaired formal metric files used in the main results\. This merge has no missing evaluations\. We treat severity 0 as aligned and severity at least 2 as high bias\. The analysis is diagnostic: it does not relabel predictions, change metric weights, or call an additional LLM\. Table[A12](https://arxiv.org/html/2606.00644#A6.T12)gives the numeric contrasts underlying Figure[3](https://arxiv.org/html/2606.00644#S5.F3)\. Temporal\-horizon annotations are retained in the released analysis files and included in the normalized effect\-size heatmaps\.

QuestionDiagnostic contrastMain resultInterpretationCausal\-role bias vs\. FactAligned vs\. high causal\-role biasFact drops by 1\.13 SD under high causal\-role driftFact failures often arise when the answer assigns the wrong causal role to a plausible technical object, such as treating a symptom as a root bottleneck or a solution as the forecast target\.Claim\-scope bias vs\. FTAAligned vs\. high claim\-scope biasFTA drops by 1\.22 SD overall; Planning FTA drops by 1\.86 SDScope mismatch is most harmful when the task evaluates a global ranking or priority structure, not just topical semantic proximity\.Intervention\-mode shift vs\. FTAMode\-aligned vs\. mode\-shifted answersFTA drops by 1\.12 SD overall; Planning FTA drops by 1\.99 SDChanging the action type, such as benchmark\-first vs\. training\-first, directly changes planning and venue decisions even when local rationales remain plausible\.High traceability but low FTAHigh\-trace/non\-low\-FTA rows vs\. high\-trace/low\-FTA rows213 high\-trace/non\-low\-FTA rows vs\. 30 high\-trace/low\-FTA rows; causal severity 1\.268→\\rightarrow2\.633; intervention\-mode severity 1\.592→\\rightarrow2\.833High traceability does not guarantee target alignment: evidence can support a coherent but misaligned decision object\.Table A12:Diagnostic coupling between answer\-reference drift dimensions and formal metric failures\. Metric drops are normalized by the standard deviation of the corresponding metric over all audited rows, so they are expressed in standard\-deviation units\. All numbers come from the matched bias\-annotation sample joined with the repaired formal metrics\.High Traceability But High Drift: A Venue\-Positioning CaseCaseGemini\-3 \+ ARIS, RFVENUE\-0062, venue\-conditioned positioningTaskRecommend the 2026 venue family for a reinforcement\-learning\-from\-AI\-feedback paper in language\-model post\-training, using only the 2025\-12\-31 research profile\.Agent answerRanksNeurIPSfirst, treating the work as a general methodological contribution in reinforcement learning and model alignment; ACL/EMNLP is listed as a secondary target\.Reference targetRanksACL/EMNLPfirst\. The reference frames the paper as a language\-model post\-training and alignment contribution that should be evaluated against RLHF, SFT, DPO\-style preference optimization, and human or expert feedback validation\.ScoresTrace =0\.920, Fact =0\.200, FTA =0\.355, Pers\. = 0\.720\.Drift diagnosisThe answer is evidence\-supported and venue\-plausible, but it reflects a framing drift: it prioritizes a broad NeurIPS\-style methodological interpretation, whereas the reference target prioritizes the NLP post\-training and alignment venue context\.Figure A6:Concrete high\-traceability/high\-drift example\. The task, agent answer, and reference target are summarized for readability; the scalar scores are from the repaired formal metrics, and the drift diagnosis comes from the matched answer\-reference drift audit\.![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/bias_metric_coupling_family_breakdown_appendix.png)Figure A7:Family\-level decomposition of bias–metric coupling\. Each heatmap reports normalized score drop,\(𝔼\[m∣s=0\]−𝔼\[m∣s≥2\]\)/SD\(m\)\(\\mathbb\{E\}\[m\\mid s=0\]\-\\mathbb\{E\}\[m\\mid s\\geq 2\]\)/\\mathrm\{SD\}\(m\), for one task family\. Rows are formal metrics and columns are answer\-reference drift dimensions\.The case inventory supports the same interpretation\. In RFDIR\-0104, a GPT\-5\.2 ResearchAgent answer has high Traceability \(0\.820\), high Reviewer Persuasiveness \(0\.820\), and moderate FTA \(0\.638\), but zero Prediction Factuality because it turns the target direction into a standardized PEFT evaluation card, whereas the reference expects adaptive sparse zeroth\-order or hybrid PEFT\. In RFPLAN\-0136, a Qwen3\-235B CoI answer is traceable \(0\.700\) and judged plausible \(0\.720\), but FTA is only 0\.145 because it deprioritizes embodied evaluation due to immature tooling, while the reference treats that immaturity as the reason evaluation infrastructure should be the top priority\. These examples show why high\-trace, low\-FTA cases are better interpreted as evidence\-to\-decision failures than as simple evidence absence\.

### F\.3Content\-Metric Correlation Diagnostic

We additionally test whether the three content\-facing formal metrics collapse into a single latent score\. This diagnostic uses the same 10,000 repaired evaluations as the formal error analysis and computes pairwise Pearson and Spearman correlations among Prediction Factuality, Future\-Target Alignment, and Reviewer Persuasiveness\. Evidence Traceability is excluded from this correlation analysis: it measures support exposure and answer interface rather than content alignment, and is not applicable to Native LLM outputs without external support artifacts\.

Overall, the three content metrics are positively related but not redundant\. Spearman correlation is 0\.491 for Fact vs\. FTA, 0\.231 for Fact vs\. Pers\., and 0\.336 for FTA vs\. Pers\. The main structure is family\-conditioned\. Rank\-style families show much stronger Fact\-vs\.\-FTA coupling: Strategic Planning has Spearman 0\.816 and Venue Positioning has 0\.743, compared with 0\.292 for Bottleneck–Opportunity and 0\.368 for Direction Forecasting\. This is expected because in rank\-style tasks, factual claim correctness and target alignment both depend on choosing the right priority or venue\-conditioned decision\. Venue also illustrates non\-redundancy: its Fact\-vs\.\-FTA correlation is high, but FTA vs\. Pers\. is only 0\.274, so matching the venue target does not automatically imply a persuasive research decision\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/metric_correlation_summary.png)Figure A8:Supplementary content\-metric correlation diagnostic over 10,000 repaired formal rows\. Values are Spearman rank correlations among Prediction Factuality, Future\-Target Alignment, and Reviewer Persuasiveness\. Evidence Traceability is excluded because it measures support exposure and answer interface rather than content alignment\. Family structure dominates the correlation pattern: rank\-style planning and venue tasks have much stronger Fact\-vs\.\-FTA coupling than bottleneck and direction tasks, while backbone and method differences are more moderate\.Backbone and method effects are present but smaller than family effects\. Across backbones, Fact vs\. FTA Spearman ranges from 0\.368 for Qwen3\-235B to 0\.577 for GLM\-4\.6, while FTA vs\. Pers\. stays between 0\.280 and 0\.421\. Across methods, Fact vs\. FTA ranges from 0\.459 for CoI\-style to 0\.534 for Native LLM, and FTA vs\. Pers\. ranges from 0\.323 for ResearchAgent\-style to 0\.357 for Native LLM\. This supports the paper’s use of multiple metrics: they move in related directions, but their coupling depends on task form and failure mode rather than on a universal scalar quality dimension\.

## Appendix GProspective Forecast Package

ForeSci is primarily an evaluation benchmark with hidden future supervision\. We also use its construction machinery in a more prospective setting: forecasting events that had not yet occurred at writing time\. This appendix gives a compact example rather than a scored benchmark result\. The exercise has no post\-hoc ground truth and no judge/evaluation scores; it is reported only to illustrate how taxonomy evolution can seed near\-future research questions and how methods answer them under the same cutoff\.

The exercise focuses on the LLM\-agent domain with a literature cutoff of 2026\-05\-15\. The released prospective package contains 12 public questions, balanced across the four task families\. The forecast window for bottleneck, direction, and planning tasks is 2026\-05\-16 to 2026\-08\-15; venue questions are advisory venue\-positioning decisions rather than claims about eventual acceptance\. The strict corpus used to construct this package contains 12,124 LLM\-agent rows, after adding 1,380 deduplicated core/support\-screened papers from 2026\-04\-01 to 2026\-05\-15 to the previous 2026M03 corpus\. The same run induces 278 taxonomy nodes in 2026M05A and a method\-evolution asset with 22 method nodes, 20 specialization edges, and 440 paper\-method mention rows\.

SliceNewCum\.NodesWidthDepth2026M031,01910,744243112026M0468511,429261772026M05A69512,12427854Table A13:Recent LLM\-agent taxonomy evolution used to seed the prospective forecast package\. Counts are from the strict core/support\-screened taxonomy corpus; 2026M05A covers papers through 2026\-05\-15\.We generated answers with Qwen3\-235B and GPT\-5\.2 for all five method configurations\. Both runs completed all 60 requested final answers, covering 12 tasks for each of Native LLM, Hybrid RAG, CoI\-style, ResearchAgent\-style, and ARIS\-style\. No scores are reported because this package concerns future outcomes with no available ground truth\. The appendix below therefore treats the run as a transparent forecast artifact: it selects two easily interpretable examples, one from Direction Forecasting and one from Venue\-Conditioned Positioning, and reproduces their public questions and unscored generated answers\. A standalone supplement contains the broader prospective forecast artifact\.

##### Selected unscored forecast outputs\.

We show selected generated answers for Direction Forecasting and Venue\-Conditioned Positioning\. Figures[4](https://arxiv.org/html/2606.00644#S5.F4)and[A9](https://arxiv.org/html/2606.00644#A7.F9)compare Native LLM outputs with selected agent outputs in paired answer cards\. For Venue\-Conditioned Positioning, we compress the original longer answers while preserving each method’s venue ranking, framing advice, major risks, and evidence\-upgrade recommendations\. The standalone supplement contains the verbatim prediction\-only outputs\.

Direction forecasting example\.RFDIR\-0149 – Direction Forecasting\.

Metadata\.Cutoff: 2026\-05\-15\. Forecast window: 2026\-05\-16 to 2026\-08\-15\.

Question\.Looking forward from the 2026\-05\-15 snapshot, which agent\-memory direction is most likely to gain momentum over the next three months: better retrieval, auditable memory security, larger context stores, or persona personalization? Choose one direction, assign one of the trajectory labels accelerating, fragmenting, steady, or cooling, and justify the choice\.

Expected answer format\.State one concrete forecasted research direction; state exactly one trajectory label; give a brief rationale using only evidence and field conditions available by the cutoff\.

Support requirements\.Answer the specific domain and time window, use only pre\-cutoff reasoning, keep the answer to one prioritized direction, and connect the rationale to relevant mechanisms, systems, evaluations, bottlenecks, or adoption conditions\.

Venue\-conditioned positioning example\.Figure[A9](https://arxiv.org/html/2606.00644#A7.F9)shows compressed outputs for a prediction\-only venue task, preserving the venue ranking and the main positioning rationale while omitting low\-level generation details\.

![Refer to caption](https://arxiv.org/html/2606.00644v1/figures/question_example_appendix.png)Figure A9:Side\-by\-side compressed Venue\-Conditioned Positioning outputs for the prediction\-only LLM\-agent case\. The figure includes the task question and compressed Native LLM and ResearchAgent outputs\.

## Appendix HDiscussion: Beyond Information Retrieval

##### Output rendering matters\.

ForeSci exposes a practical tension: internal reasoning and the final benchmark\-facing answer are not the same object\. Some agents retrieve or organize useful evidence but fail to surface it in a form that satisfies the task contract\. We therefore use a common final\-answer renderer for agent methods, making comparisons less about verbosity and more about whether each method’s evidence can be converted into a defensible research judgement\. This separates rendering failures, upstream retrieval failures, and cases where evidence is present but the proposed research move is not compelling\.

##### Generalization versus specialization\.

No method is uniformly best across foresight decisions\. Retrieval often improves target proximity for direction forecasting, but the repaired FTA surface also shows that semantic closeness to a future target is not the same as selecting the right research decision\. Evidence organization helps traceability, and idea\-construction scaffolds can improve rubric\-based Reviewer Persuasiveness\. But bottleneck discovery, forecasting, strategic planning, and venue positioning impose different constraints: agents need task\-aware mechanisms for deciding what kind of research move they are making and what evidence would make that move credible\.
ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Similar Articles

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI scientists produce results without reasoning scientifically

LEAF: A Living Benchmark for Event-Augmented Forecasting

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Agent Evaluation: A Detailed Guide (53 minute read)

Submit Feedback

Similar Articles

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
AI scientists produce results without reasoning scientifically
LEAF: A Living Benchmark for Event-Augmented Forecasting
FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Agent Evaluation: A Detailed Guide (53 minute read)