PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts
Summary
PolitNuggets is a multilingual benchmark for evaluating large reasoning models within agentic frameworks on their ability to discover and synthesize long-tail political facts by constructing biographies for 400 global elites. The benchmark introduces evaluation protocols like FactNet and reveals that current systems struggle with fine-grained details and efficiency.
View Cached Full Text
Cached at: 05/15/26, 06:18 AM
# PolitNuggets: Benchmarking Agentic Discovery of Long-Tail Political Facts
Source: [https://arxiv.org/html/2605.14002](https://arxiv.org/html/2605.14002)
###### Abstract
Large Reasoning Models \(LRMs\) embedded in agentic frameworks have transformed information retrieval from static, long\-context question answering into open\-ended exploration\. Yet real\-world use requires models to discover and synthesize “long\-tail” facts from dispersed sources, a capability that remains under\-evaluated\. We introducePolitNuggets, a multilingual benchmark for agentic information synthesis via constructing political biographies for 400 global elites, covering over 10,000 political facts\. We standardize evaluation with an optimized Supervisor–Searcher multi\-agent system and proposeFactNet, an evidence\-conditional protocol that scores discovery, fine\-grained accuracy, and efficiency\. Across models and settings, we find that current systems often struggle with fine\-grained details, and vary substantially in efficiency\. Finally, using benchmark diagnostics, we relate agent performance to underlying model capabilities, highlighting the importance of short\-context extraction, multilingual robustness, and reliable tool use\.
PolitNuggets: Benchmarking Agentic Discovery of Long\-Tail Political Facts
Yifei ZhuThe University of Hong Kongfrankyifei@connect\.hku\.hk
## 1Introduction
Reasoning and synthesizing information within a given context is the defining capability of modern Large Reasoning Models \(LRMs\)\. The key framework can be calledReasoning*in*Context, where a model is*passively*provided a finite set of evidence and must extract or synthesize answers from it\(Lewiset al\.,[2020](https://arxiv.org/html/2605.14002#bib.bib21); Guuet al\.,[2020](https://arxiv.org/html/2605.14002#bib.bib22)\)\. The rapid growth of context windows has enabled strong performance on long\-document tasks\(Shahamet al\.,[2023](https://arxiv.org/html/2605.14002#bib.bib23); Anet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib24); Zhanget al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib25); Baiet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib27); Vodrahalliet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib28); Yanget al\.,[2025b](https://arxiv.org/html/2605.14002#bib.bib29); Yenet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib30)\)\.
However, a new paradigm is emerging\. By integrating LRMs into agentic frameworks equipped with retrieval tools, models can now actively explore, filter, and construct their own context from open\-ended sources like the webpages and codebases\(Nakanoet al\.,[2021](https://arxiv.org/html/2605.14002#bib.bib17); Schicket al\.,[2023](https://arxiv.org/html/2605.14002#bib.bib19); Zhouet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib18)\)\. This unlocks a different layer of complexity:Reasoning*through*Context\. Unlike the passive in\-context setting, here the agent must navigate a potentially unbounded information space, making sequential decisions on*what*to read,*when*to stop, and*how*to synthesize fragmented evidence into a coherent whole\(Weiet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib15)\)\.
Figure 1:Agent performance heatmap on an example biography \(Erik Solheim\), illustrating the “head” vs\. “long\-tail” synthesis gap\.While production systems like OpenAI Deep Research\(OpenAI,[2025b](https://arxiv.org/html/2605.14002#bib.bib32)\)and Perplexity Deep Research\(Perplexity AI,[2025](https://arxiv.org/html/2605.14002#bib.bib33)\)demonstrate the promise of this agentic paradigm, there remains a lack of rigorous benchmarks for Reasoning*through*Context under longitudinal synthesis demands\. Many existing agentic evaluations emphasize short\-horizon interactions or isolated fact retrieval\(Yaoet al\.,[2022](https://arxiv.org/html/2605.14002#bib.bib34); Mialonet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib13); Weiet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib15)\), and therefore under\-measure the professional workflow of reconstructing a coherent narrative from scattered, disconnected, and sometimes contradictory sources\. Further, few have linked the performance of a model’s reasoning*through*context ability with the performance of a model’s reasoning*in*context\.
To bridge this gap, we introduce PolitNuggets, a benchmark grounded in a high\-impact and realistic task: the construction of political biographies\. Wikipedia, while a triumph of collaborative human curation, exhibits systematic coverage gaps—particularly for non\-US officials—and often lacks the fine\-grained precision required for professional domains like, academic research or political consulting\. PolitNuggets tests models’ reasoning\-through\-context abilities by discovering the long\-tail biography “nuggets” from the open web\. This evaluation demands long\-horizon reasoning, multi\-language understanding, and reliable tool use\. Our benchmark also characterizes a static corpus to evaluate models’ reasoning\-in\-context ability\.
Our evaluation of models within an agentic framework reveals that, although agents maintain high precision, they consistently struggle with recall in open\-ended settings\. We also observe a substantial performance degradation for Non\-US entities \(up to∼40%\\sim 40\\%relative drop in F1 in some settings\), highlighting a pronounced International Evidence Gap and demonstrating that multilingual robustness is a prerequisite for realistic use\. We also connect the reasoning through context ability with the reasoning in context ability\. Interestingly, the evaluation results reveal aLong\-Context Paradox: strong long\-context reading \(Reasoning*in*Context\) does not reliably predict end\-to\-end agent performance \(Reasoning*through*Context\); rather, success is driven by short\-context reading precision, reliable tool use, and multi\-language understanding\.
### 1\.1Traversing a latent fact network
We conceptualize political biography reconstruction not as single\-shot retrieval, but as traversing a latent fact network\. Let a target biography induce a directed graphG=\(V,E\)G=\(V,E\), where nodesVVare atomic “political nuggets” \(e\.g\.,*Minister of Defense, 2012–2015*\) and edgesEEare latent temporal/causal links expressed in unstructured text \(e\.g\.,*“After resigning in 2015, she joined the World Bank”*\)\. The agent starts from a seed \(entity name and minimal metadata\) and must recover the relevant subset ofVVby expanding along implicit edges discovered in documents\.
This induces an optimization trilemma over correctness, coverage, and cost\. Agents must maintain high precision \(avoid unsupported events\), high coverage \(high recall over missing events in the long tail\), and low efficiency cost \(search steps/tokens\)\. This framing explains why naive RAG is insufficient: the missing long\-tail nodes may be weakly connected, requiring multi\-hop query reformulation and evidence accumulation\.
PolitNuggets evaluates whether agents can approximate the full latent fact network while retaining the efficiency of strategic traversal\. Strategic traversal jumps between salient nodes \(low cost, but vulnerable to missing weakly connected phases if reasoning fails\)\.
Figure 2:Language composition of retrieved evidence across countries\. Bars show the share of retrieved tokens that are English vs\. non\-English; the right\-side labels show the number of evaluated cases per country in our benchmark\.
## 2Benchmark & Task
The PolitNuggets benchmark evaluates agents on their ability to construct accurate, time\-resolved career histories for 400 political elites sourced from global government directories\.
### 2\.1Multilingual evidence in the wild
The evidence required to reconstruct political careers is inherently multilingual\. Traversing a global biography is not merely a search problem: an agent must reason*through*multilingual context to decide what to read next, how to reformulate queries, and when a claim is sufficiently supported\. To characterize the language composition of the documents an agent must consume, we compute \(for each country\) the fraction of retrieved evidence tokens that are in English versus non\-English, based on the full set of pages and passages collected during the agentic experiments \(Figure[2](https://arxiv.org/html/2605.14002#S1.F2)\)\.
Our benchmark instances are drawn from the WhoGov dataset with a US and Non\-US sampling design\. We randomly sample 200 Non\-US cabinet politicians from WhoGov \(which provides names and basic metadata for over 58,000 global cabinet members from 1966 to 2023\), and we also randomly sample 200 US legislators and senators\. After preprocessing and filtering \(e\.g\., ID matching\), this yields the 400\-entity evaluation set reflected in Figure[2](https://arxiv.org/html/2605.14002#S1.F2)\.
### 2\.2Evaluation Levels: Event\-Level vs\. Attribute\-Level
To disentangle an agent’s ability to*find*relevant evidence from its ability to*extract fine\-grained details*, we compute F1 at two levels of granularity, following standard slot\-filling terminology\.
1. 1\.Event\-Level F1 \(Discovery\):Measures whether the agent correctly identifies the existence of a biographical event\. A prediction is a true positive if the Role and Organization match the ground truth and the Year \(Start/End\) is correct\. This primarily tests discovery \(did the agent find the right nugget?\)\.
2. 2\.Attribute\-Level F1 \(Granularity\):Measures whether the agent can fill fine\-grained attributes for an event \(slot filling\)\. A prediction matches only if the event\-level criteria are met*and*the Start Month, End Month \(within a 1\-month tolerance\), and Exact Official Title are correct\. This primarily tests reading comprehension and schema compliance \(did the agent read details correctly?\)\.
The above slot structure \(Role/Organization/Date/Title\) applies to career and party events; other event types use type\-specific key fields \(e\.g\., relation and name for relatives, institution and degree for education\) with matching criteria adapted accordingly\. Cross\-lingual equivalence \(e\.g\., Norwegian titles vs\. English ground truth\) is delegated to the evidence\-conditional LLM judge rather than deterministic string normalization\.
### 2\.3Experiment design and conditions
Model selection\.To assess the current frontier of agentic information synthesis, we select models that jointly satisfy three constraints required by PolitNuggets: \(i\) Reasoning*in*Context \(strong synthesis from a static context window\), \(ii\) Reasoning*through*Context \(robust tool use and multi\-turn planning\), and \(iii\) affordability/efficiency \(enabling evaluation at the scale of hundreds of entities\)\. As practical proxies, we prioritize models that score highly on OpenAI’s Multi\-Round Contextual Reasoning \(MRCR\) benchmark for long\-context reasoning\(OpenAI,[2025c](https://arxiv.org/html/2605.14002#bib.bib35)\)and on the Berkeley Function Calling Leaderboard \(BFCL v3\) for tool\-use reliability\(Patilet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib37)\), while favoring “Flash/Fast” variants or efficient open\-weight models over prohibitively expensive frontier offerings\. This yields our evaluated set: Grok\-4\-Fast\(xAI,[2025](https://arxiv.org/html/2605.14002#bib.bib38)\), Gemini\-2\.5\-Flash\(Comaniciet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib39)\), and Qwen\-3 \(80B/225B\)\(Yanget al\.,[2025a](https://arxiv.org/html/2605.14002#bib.bib40)\)\.
Task design\.To disentangle*retrieval*capability from*discovery*capability, we evaluate models in two context conditions: with Wiki \(Enhancement\), where the agent is initialized with the target’s existing Wikipedia text and must verify claims and fill missing gaps, and without Wiki \(Reconstruction\), where the agent starts from only the entity’s name and must reconstruct the timeline from open\-web sources \(news archives, government gazettes\) under a cold start\.
## 3Agentic System
### 3\.1Problem formalization
Let an entityeehave a \(latent\) biography represented as a set of time\-stamped eventsGe=\{v1,…,vn\}G\_\{e\}=\\\{v\_\{1\},\\dots,v\_\{n\}\\\}, where eachvi=\(ri,oi,ti\)v\_\{i\}=\(r\_\{i\},o\_\{i\},t\_\{i\}\)denotes a Rolerir\_\{i\}, Organizationoio\_\{i\}, and a time intervaltit\_\{i\}\(e\.g\., start/end year or month\)\. LetWe⊆GeW\_\{e\}\\subseteq G\_\{e\}denote the subset covered by the entity’s Wikipedia page \(when present\), and letPeP\_\{e\}be the set of events predicted by an agent after interacting with the open web\.
The agent executes a sequence of search queriesq1:Tq\_\{1:T\}under a policyπ\(qt∣ht\)\\pi\(q\_\{t\}\\mid h\_\{t\}\), wherehth\_\{t\}is the interaction history \(retrieved snippets, intermediate notes, and partial timeline\)\. Each query incurs a costc\(qt\)c\(q\_\{t\}\)\(e\.g\., a search step and/or token usage\), with a budget constraint∑t=1Tc\(qt\)≤C\\sum\_\{t=1\}^\{T\}c\(q\_\{t\}\)\\leq C\. The goal is to maximize coverage of missing biography events—i\.e\., high recall onGe∖WeG\_\{e\}\\setminus W\_\{e\}—while remaining within budget:
maxπE\[Recall\(Pe,Ge∖We\)\]s\.t\.∑t=1Tc\(qt\)≤C\.\\max\_\{\\pi\}\\ \\mathrm\{E\}\\big\[\\mathrm\{Recall\}\(P\_\{e\},\\ G\_\{e\}\\setminus W\_\{e\}\)\\big\]\\quad\\text\{s\.t\.\}\\quad\\sum\_\{t=1\}^\{T\}c\(q\_\{t\}\)\\leq C\.
### 3\.2Architecture Details
We implement a standardized Supervisor–Searcher architecture with a clean tool interface to support long\-horizon interaction while remaining operationally bounded \(Figure[3](https://arxiv.org/html/2605.14002#S3.F3)\)\.
1. 1\.Supervisor:Maintains global state via a running search summary and a to\-do list\. It decomposes the biography task into concrete search instructions for the Searcher and decides when to terminate the overall run \(e\.g\., when marginal returns diminish or the step budget is reached\)\.
2. 2\.Searcher:Executes search and browse/retrieve actions over unstructured web resources and returns targeted observations to the Supervisor\. In addition to reporting observations, the Searcher can persist related chunks \(source\-linked evidence snippets\) into an Archive\. Keeping related records promotes detailed communication\.
Finally, a specialized Coder agent maps the collected evidence into the strict JSON schema required for evaluation\. In the final stage, we provide the Coder with both the Supervisor’s report \(summary \+ resolved to\-do state\) and the set of archived related chunks: the report provides global structure and resolved ambiguities, while the raw evidence supplies the fine\-grained details needed for attribute filling\.
An ablation study shows that adding Archive\-backed evidence persistence yields a substantial gain \(equivalently, removing the Archive drops Event\-Level performance byΔF1≈−0\.05\\Delta\\mathrm\{F1\}\\approx\-0\.05\), supporting memory as a core design choice \(Appendix[A\.1\.1](https://arxiv.org/html/2605.14002#A1.SS1.SSS1.Px5)\)\.
##### Architecture vs\. DeepResearch\.
Empirically, our agentic architecture produces a recall\-oriented operating point: the best\-performing setting in our system \(Grok\-4\-Fast\) achieves higher Event\-Level recall than Gemini DeepResearch \(powered by gemini 2\.5 pro\) in the With\-wiki condition \(US: 0\.703 vs\. 0\.678; Non\-US: 0\.620 vs\. 0\.577\), while Gemini DeepResearch is more precision\-oriented \(EventPrec US: 0\.912 vs\. 0\.890; Non\-US: 0\.892 vs\. 0\.872; Appendix Table[4](https://arxiv.org/html/2605.14002#A1.T4)\)\.
Figure 3:The PolitNuggets Framework\.\(Top\) Agentic system:Supervisor\+Searcher \(\+Archive\) produces an Agentic Bio and the evidence corpora \(Archive \+ retrieved pages\)\.\(Middle\) Long\-context LRM baselines:the Base LRM consumes these corpora to produce LRM bios \(short\-context from Archive; long\-context from raw pages\)\.\(Bottom\) FactNet:evaluates the bios with a dynamic novelty ground truth by filtering Wikipedia\-covered facts and validating candidate novel nuggets against archived evidence\.
## 4Evaluation Protocol
Standard exact\-match metrics penalize agents for finding valid information that is absent from the ground truth \(false positives\)\. To address this, we employ the FactNet dynamic evaluation protocol\. We report F1 at two levels of granularity: Event\-Level F1 \(discovery of the correct role/organization/year event\) and Attribute\-Level F1 \(strict matching on fine\-grained attributes such as start/end month and exact title, conditioned on a correct event\)\.
### 4\.1Evaluation design
LetGeG\_\{e\}be the evidence\-verified biography nuggets for entityee, and letWe⊆GeW\_\{e\}\\subseteq G\_\{e\}denote the subset covered by the entity’s Wikipedia page\. We constructGeG\_\{e\}—theConsolidated Ground Truth \(CGT\)—incrementally from pooled evidence across agentic runs rather than from manual enumeration or Wikipedia\. An initial batch of runs produces the seed set; as subsequent runs surface new candidate nuggets, each is verified against the proposing run’s archived evidence using the Judge LRM and, if supported, added toGeG\_\{e\}\. All systems are scored against the final snapshot\. Wikipedia is used only to defineWeW\_\{e\}\(the coverage filter\) and to support the With\-Wiki condition\. We validate CGT quality via manual timeline inspection \(coverage\) and human–LLM judge consistency audits plus independent fact\-checking via Exa111[https://exa\.ai](https://exa.ai/)\. Exa is an independent multilingual search backend used for audit\.\(precision; Appendix[A\.1\.1](https://arxiv.org/html/2605.14002#A1.SS1.SSS1.Px4)\)\.
Our primary target is the novel setG=Ge∖WeG=G\_\{e\}\\setminus W\_\{e\}\(i\.e\., facts not already available on Wikipedia at evaluation time\)\. LetPPbe the set of predicted nuggets produced by a system \(agentic bio or LRM bio\)\. We score predictions against a*dynamic novelty ground truth*G′G^\{\\prime\}, initialized asGGand expanded via novelty validation below to avoid penalizing supported discoveries missing from the curated set\.
- •Novelty Validation \(Dynamic Novelty CGT\):For any predicted nuggetp∈Pp\\in Psuch thatp∉Gp\\notin G, we treatppas a candidate novel nugget and trigger verification\. An external Judge LRM \(gpt\-5\-mini\) checks whetherppis supported by the system’s own evidence \(source\-linked passages in the Archive\)\. If supported \(and not Wikipedia\-covered\),ppis added toG′G^\{\\prime\}; otherwise it remains a false positive\. This yields a Dynamic Novelty CGT that credits verifiable new discoveries while maintaining evidence\-grounded precision\.
- •Judge reliability checks:We examined the consistency of this judge with human coders and an external search provider \(Exa\) via manual re\-judging and independent fact checks \(Appendix[A\.1\.1](https://arxiv.org/html/2605.14002#A1.SS1.SSS1.Px4)\)\.
- •F1 Score:Calculated on the dynamic setG′G^\{\\prime\}: Precision=\|P∩G′\|\|P\|,\\displaystyle=\\frac\{\|P\\cap G^\{\\prime\}\|\}\{\|P\|\},Recall=\|P∩G′\|\|G′\|,\\displaystyle=\\frac\{\|P\\cap G^\{\\prime\}\|\}\{\|G^\{\\prime\}\|\},F1\\displaystyle F\_\{1\}=2⋅Precision⋅RecallPrecision\+Recall\.\\displaystyle=\\frac\{2\\cdot\\text\{Precision\}\\cdot\\text\{Recall\}\}\{\\text\{Precision\}\+\\text\{Recall\}\}\.
- •Efficiency Cost:Measured as Average Search Steps per Entity and Total Token Usage\. This quantifies the “cognitive effort” required to achieve a given F1 score\.
### 4\.2Final evaluation
We evaluate two families of biographies produced from the same underlying evidence collection runs\.
##### Agentic bios\.
Our agentic system produces Agentic Bios in two context conditions: With Wiki enhancement \(4 models: Grok\-4\-Fast, Gemini\-2\.5\-Flash, Qwen\-3\-225B, Qwen\-3\-80B\) and Without Wiki reconstruction \(2 models: Grok\-4\-Fast, Gemini\-2\.5\-Flash\), yielding 6 agentic bio types in total\.
##### Long\-context LRM bios \(baselines\)\.
To quantify “Reasoning*in*Context” without agentic search, we ask each Base LRM to generate a biography directly from fixed evidence corpora produced by the Grok\-4\-Fast With\-Wiki runs \(the best\-performing agentic setting\), yielding 8 LRM bio types \(4 models×\\times2 corpora\): \(i\) a Short\-context bio from the curated Archive \(fine\-grained, deduplicated evidence chunks;∼\\sim30k tokens on average\), and \(ii\) a Long\-context bio from the concatenated Retrieved Web Pages \(raw full documents from the same sessions;∼\\sim300k tokens on average\)\. This isolates improvements attributable to active planning, search, and evidence persistence \(Reasoning*through*Context\) versus what the Base LRM can achieve from a single static context window\. Importantly, all LRMs are evaluated on the same fixed corpora\.
## 5Experimental Results
### 5\.1Main Performance Analysis
We analyze agent performance through the lens of a three\-dimensional evaluation framework: Discovery \(Event\-Level F1\), Granularity \(Attribute\-Level F1\), and Efficiency \(search steps/tokens\)\. Table[1](https://arxiv.org/html/2605.14002#S5.T1)presents the comprehensive performance across all experimental conditions\. Grok\-4\-Fast is the strongest model across both evaluation levels and both contexts, while also using fewer search steps\. With Wiki, Grok\-4\-Fast achieves the best Event\-Level F1 \(US: 0\.768; Non\-US: 0\.712\) and the best Attribute\-Level F1 \(US: 0\.501; Non\-US: 0\.475\) at 11\.1 steps on average; without Wiki it remains strong \(Event\-Level US/Non\-US: 0\.766/0\.708; Attribute\-Level US/Non\-US: 0\.506/0\.475\) with 14\.5 steps\. In contrast, Gemini exhibits comparable F1 in some settings but at substantially higher cost in cold\-start reconstruction \(18\.1 steps\), and Qwen variants trail in both Event\-Level discovery and Attribute\-Level slot filling\.
Table 1:Main results\. Performance is reported as F1 at two evaluation levels: Event\-Level \(discovery of role/organization/year\) and Attribute\-Level \(slot filling of month\-level dates and exact titles\)\.ContextModelRegionEventF1AttrF1With wikiGemini DRUS0\.7780\.505Non\-US0\.7010\.489With wikiGeminiUS0\.6380\.407Non\-US0\.6790\.485With wikigrok\-4 FastUS0\.7680\.501Non\-US0\.7120\.475With wikiqwen\-225BUS0\.4990\.335Non\-US0\.4400\.306With wikiqwen\-80BUS0\.5100\.344Non\-US0\.4120\.291Without wikiGeminiUS0\.6710\.439Non\-US0\.6180\.468Without wikigrok\-4 FastUS0\.7660\.506Non\-US0\.7080\.475
*Note:*The Without\-Wiki condition is limited to Grok\-4\-Fast and Gemini\-2\.5\-Flash because cold\-start reconstruction retrieves substantially more documents, which can exceed Qwen’s maximum context window \(256k tokens\), leading to unstable runs\.
##### Finding 1: Discovery and granularity remain unsolved—and the gap is driven by recall, not precision\.
Performance drops sharply when moving from Event\-Level to Attribute\-Level evaluation, and even Event\-Level discovery is far from saturated\. For example, Grok\-4\-Fast drops from 0\.768 to 0\.501 F1 \(US\), showing that extracting month\-level dates and exact titles remains difficult\. Decomposing performance reveals that this shortfall is primarily a recall/coverage problem rather than a precision problem: precision remains consistently high while recall is substantially lower and deteriorates further at the Attribute\-Level \(Appendix Table[4](https://arxiv.org/html/2605.14002#A1.T4)\)\. In other words, agents are largely conservative—they tend to miss weakly connected long\-tail events and attributes rather than fabricate unsupported ones\. This precision–recall shape mirrors the passive vs\. active split: models are often strong at Reasoning*in*Context once a relevant snippet is found, but fail at Reasoning*through*Context when they must autonomously discover that snippet in the first place \(Section[6](https://arxiv.org/html/2605.14002#S6)and Table[3](https://arxiv.org/html/2605.14002#A1.T3)\)\.
##### Finding 2: The International Evidence Gap\.
We observe a consistent performance degradation for Non\-US entities across most models\. While Gemini maintains parity in Event\-Level F1 with Wiki \(0\.638 US vs\. 0\.679 Non\-US\), Qwen\-225B drops from 0\.499 \(US\) to 0\.440 \(Non\-US\) at the Event\-Level, and performance also degrades at the Attribute\-Level across settings\. This highlights a structural bias: lower availability of English\-language sources and higher complexity in parsing non\-US government archives significantly hamper agentic synthesis\.
### 5\.2Efficiency Analysis: The Pareto Frontier
Figure 4:Efficiency Analysis: Search steps vs\. F1 \(top\) and Token usage vs\. F1 \(bottom\)\. Grok\-4\-Fast occupies the efficient frontier \(top\-left\), achieving high F1 with minimal steps/tokens—a form of superior cognitive economy\. Without Wikipedia context, Gemini maintains similar accuracy but requires substantially higher search volume \(long dashed lines\), relying on more search rather than improved reasoning efficiency\.To visualize the trade\-off between performance and computational cost, we plot the Efficiency Pareto Frontier in Figure[4](https://arxiv.org/html/2605.14002#S5.F4)\. The dashed vertical and horizontal reference lines denote the average cost \(steps/tokens\) and the average F1, splitting the plane into four quadrants\. The desirable “nice zone” is the top\-left quadrant \(above\-average F1 with below\-average cost\), while the bottom\-right quadrant corresponds to below\-average F1 with above\-average cost\.
Figure 5:Model capability analysis \(Event\-Level\)\. Each panel plots a normalized capability score \(x\-axis\) against end\-to\-end Agent F1 \(y\-axis\)\.\(a\)Short\-Context Extraction,\(b\)Long\-Context Recall,\(c\)Long–Short Gap,\(d\)Parametric Knowledge \(closed\-book\),\(e\)Multilingual Robustness,\(f\)Tool\-Use Reliability \(BFCL\)\. A positive trend indicates that the capability predicts agentic success\.Two robustness patterns emerge\. First, removing Wikipedia context substantially increases cost \(points shift rightward to higher steps/tokens\), yet accuracy changes modestly\. We interpret this “Wiki removal” setting as a stress test of agentic robustness under prolonged trajectories: even as reasoning and search steps rise, F1 does not collapse, suggesting the framework can sustain longer\-horizon interaction without severe degradation once it accumulates sufficient evidence\. Second, Non\-US settings tend to be less efficient than US settings, with many Non\-US points shifting toward higher cost and/or lower F1, consistent with a multilingual evidence burden and more fragmented source ecosystems\.
Finding 3: Wiki removal reveals efficiency gap\.Figure[4](https://arxiv.org/html/2605.14002#S5.F4)shows that removing Wiki context consistently shifts points rightward \(higher steps/tokens\), moving runs out of the top\-left “nice zone\.”\. Across models, we observe a clear cognitive\-economy gap: Grok typically achieves comparable F1 with fewer steps, while Gemini more often substitutes search volume for reasoning precision \(a “brute force” strategy\)\. Taken together, these results suggest that the core remaining challenge is not simply “more thinking,” but more efficient retrieval and search strategy: better query planning and evidence targeting are required to achieve high coverage without paying a large cost increase\.Statistical significance\.The key efficiency gaps highlighted here \(e\.g\., Wiki removal increasing steps/tokens for Gemini and Grok\) have bootstrap 95% confidence intervals for mean deltas that exclude 0; see Appendix[A\.2](https://arxiv.org/html/2605.14002#A1.SS2)\.
## 6LRM Analysis: Bridging Passive Reasoning and Active Discovery
Having established the agentic performance benchmarks in Section[1](https://arxiv.org/html/2605.14002#S5.T1)and the controlled “Reasoning*in*Context” baselines in Section[4\.2](https://arxiv.org/html/2605.14002#S4.SS2), we now investigate the fundamental drivers of success in longitudinal information synthesis\. Using the short/long\-context corpora defined in Section[4\.2](https://arxiv.org/html/2605.14002#S4.SS2)and the FactNet protocol in Section[4\.1](https://arxiv.org/html/2605.14002#S4.SS1), we decompose performance into six diagnostic dimensions: Short\-Context Extraction, Long\-Context Recall, the Long–Short Gap, Parametric Knowledge, Multilingual Robustness, and Tool\-Use Reliability \(BFCL\)\. This analysis aims to bridge the gap between traditional “Reasoning*in*Context” benchmarks and the emerging “Reasoning*through*Context” paradigm\. For completeness, we report the full LRM baseline results table in Appendix[A\.3](https://arxiv.org/html/2605.14002#A1.SS3)\(Table[3](https://arxiv.org/html/2605.14002#A1.T3)\)\.
### 6\.1The primacy of short\-context extraction
We observe that a model’s ability to extract facts from a clean, short context—the curated Archive—is strongly predictive of end\-to\-end agentic performance \(Figure[5](https://arxiv.org/html/2605.14002#S5.F5)a\)\. This suggests a “last\-mile” bottleneck: if the model cannot reliably parse and structure high\-quality evidence it has already found, additional searching cannot recover the loss, and end\-to\-end synthesis degrades primarily through missed events and attributes\.
### 6\.2The decoupling of long\-context recall
Crucially, we find that passive long\-context recall on the noisy, full\-document baseline is a weak predictor of agentic success\. In particular, models that dominate end\-to\-end discovery do not necessarily outperform peers on static long\-context extraction\. This validates an*agentic hypothesis*: episodic search that iteratively curates small, high\-quality contexts can outperform reliance on massive context windows alone, effectively bypassing the limitations of “Reasoning*in*Context” under noise\.
An unintuitive finding is that the Long–Short gap is not a reliable proxy for agentic success\. Degradation from curated short contexts \(Archive\) to noisy long contexts \(full retrieved pages\) does not consistently track end\-to\-end agent F1 \(Figure[5](https://arxiv.org/html/2605.14002#S5.F5)b–c\)\. One plausible explanation is a*training dilemma*\. If so, “better long\-context” and “smaller long–short degradation” may not co\-exist monotonically under realistic budget and model\-structure constraints\.
### 6\.3The multilingual reasoning barrier
The “International Evidence Gap” observed in our main results is structurally explained by a multilingual reasoning barrier \(Figure[5](https://arxiv.org/html/2605.14002#S5.F5)e\)\. Models that exhibit larger degradation when extracting from non\-English evidence chunks also underperform on Non\-US entities\. This indicates that global longitudinal synthesis is bottlenecked not only by retrieving foreign\-language documents, but by reasoning over them with comparable fidelity to English evidence\.
### 6\.4Parametric knowledge and tool reliability as scaffold
Finally, we observe a positive relationship between a model’s closed\-book \(no\-evidence\) biography ability and its capacity to discover missing facts \(Figure[5](https://arxiv.org/html/2605.14002#S5.F5)d,f\)\. Parametric knowledge appears to act as a semantic scaffold: it supports entity disambiguation, improves query formulation, and helps the agent recognize valuable nuggets when encountered in the wild\. Importantly, this semantic scaffold only helps end\-to\-end if the model can*reliably act on it*: tool\-use reliability \(BFCL\) complements parametric knowledge by reducing brittle failures in search/browse execution, enabling consistent multi\-step query reformulation, and translating high\-level intent into stable tool calls\.
## 7Related Works
Evaluating reasoning*in*context:A rich line of long\-context benchmarks has progressively raised the difficulty of passive evidence extraction\. Early “needle\-in\-a\-haystack” setups \(e\.g\., HELMET\(Yenet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib30)\)\) probe whether models can locate a single target fact in a long context\. Subsequent benchmarks extend this to multiple needles and multi\-round contextual reasoning \(MRCR\(OpenAI,[2025c](https://arxiv.org/html/2605.14002#bib.bib35)\)\), and further to structured reasoning over latent or explicit graphs \(Michelangelo\(Vodrahalliet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib28)\), GraphWalks\(OpenAI,[2025a](https://arxiv.org/html/2605.14002#bib.bib36)\)\)\. Closest to our setting, LongBioBench\(Yanget al\.,[2025b](https://arxiv.org/html/2605.14002#bib.bib29)\)uses controlled synthetic biographies to examine long\-context understanding, reasoning, and trustworthy generation\. PolitNuggets progresses from this lineage by shifting the locus of difficulty from reasoning*in*a given context to reasoning*through*context—where the agent must actively discover, filter, and synthesize evidence from the open web—and by grounding evaluation in real\-world, multilingual political biographies rather than synthetic text\.
Evaluating reasoning*through*context:Evaluation of tool\-augmented reasoning spans static multi\-hop QA datasets such as MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.14002#bib.bib14)\)and general agentic benchmarks such as GAIA\(Mialonet al\.,[2024](https://arxiv.org/html/2605.14002#bib.bib13)\)that assess fundamental tool proficiency\. More recently, benchmarks including WebSailor\(Liet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib12)\)and BrowseComp\(Weiet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib15)\)emphasize verifying retrieved information in open environments, often following a “hard\-to\-find, easy\-to\-verify” paradigm \(e\.g\., identifying a specific paper from indirect cues\)\(Weiet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib15)\)\. DeepResearch Bench\(Duet al\.,[2025](https://arxiv.org/html/2605.14002#bib.bib11)\)pushes toward realistic research workflows with expert\-crafted tasks, but this style of evaluation can be expensive and remains sensitive to verification quality\. PolitNuggets builds on this line while moving beyond isolated fact lookup to multi\-faceted biography synthesis, and provides a scalable evaluation protocol for multilingual discovery, addressing a critical gap in global agentic information retrieval\.
## 8Conclusion
PolitNuggets provides a rigorous assessment of agentic information synthesis in the wild, targeting the gap between Reasoning*in*Context \(fixed evidence\) and Reasoning*through*Context \(tool\-driven discovery\)\. We introduce a scalable benchmark of political biography construction for 400 global elites and evaluate systems with FactNet, an evidence\-conditional protocol that measures discovery, fine\-grained accuracy, and efficiency while validating candidate nuggets against retrieved evidence\. Across models and settings, we find that precision is generally high but performance is recall\- and efficiency\-limited, with Wikipedia removal substantially increasing search cost, and we observe a pronounced International Evidence Gap on Non\-US entities\. Ablations show that evidence persistence improves end\-to\-end outcomes, and our diagnostics highlight a “Long\-Context Paradox”: strong long\-context reading does not reliably predict agentic success, which is instead driven by short\-context extraction, multilingual robustness, and reliable tool use\. We hope PolitNuggets and its released artifacts support reproducible progress on evaluating and improving real\-world agentic systems\.
## Acknowledgments
We thank Songpo Yang, Xiao Liu, Jiangnan Zhu, and Junyan Jiang for insightful discussions around this project\. We also thank the anonymous reviewers for their constructive feedback\.
## References
- C\. An, S\. Gong, M\. Zhong, X\. Zhao, M\. Li, J\. Zhang, L\. Kong, and X\. Qiu \(2024\)L\-Eval: Instituting Standardized Evaluation for Long Context Language Models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Bangkok, Thailand,pp\. 14388–14411\.External Links:[Link](https://aclanthology.org/2024.acl-long.776/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.776)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- Y\. Bai, S\. Tu, J\. Zhang, H\. Peng, X\. Wang, X\. Lv, S\. Cao, J\. Xu, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2025\)LongBench v2: towards deeper understanding and reasoning on realistic long\-context multitasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vienna, Austria,pp\. 3639–3664\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.183),[Link](https://aclanthology.org/2025.acl-long.183/)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- G\. Comanici, E\. Bieber, M\. Schaekermann, I\. Pasupat,et al\.\(2025\)Gemini 2\.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities\.arXiv preprint arXiv:2507\.06261\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2507.06261),[Link](https://arxiv.org/abs/2507.06261)Cited by:[§2\.3](https://arxiv.org/html/2605.14002#S2.SS3.p1.1)\.
- M\. Du, B\. Xu, C\. Zhu, X\. Wang, and Z\. Mao \(2025\)DeepResearch bench: a comprehensive benchmark for deep research agents\.External Links:2506\.11763,[Document](https://dx.doi.org/10.48550/arXiv.2506.11763),[Link](http://arxiv.org/abs/2506.11763)Cited by:[§7](https://arxiv.org/html/2605.14002#S7.p2.1)\.
- K\. Guu, K\. Lee, Z\. Tung, P\. Pasupat, and M\. Chang \(2020\)Retrieval augmented language model pre\-training\.InProceedings of the 37th International Conference on Machine Learning \(ICML\),Proceedings of Machine Learning Research, Vol\.119,pp\. 3929–3938\.External Links:[Link](https://proceedings.mlr.press/v119/guu20a.html)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela \(2020\)Retrieval\-Augmented Generation for Knowledge\-Intensive NLP Tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- K\. Li, Z\. Zhang, H\. Yin, L\. Zhang, L\. Ou, J\. Wu, W\. Yin, B\. Li, Z\. Tao, X\. Wang, W\. Shen, J\. Zhang, D\. Zhang, X\. Wu, Y\. Jiang, M\. Yan, P\. Xie, F\. Huang, and J\. Zhou \(2025\)WebSailor: navigating super\-human reasoning for web agent\.External Links:2507\.02592,[Document](https://dx.doi.org/10.48550/arXiv.2507.02592),[Link](http://arxiv.org/abs/2507.02592)Cited by:[§7](https://arxiv.org/html/2605.14002#S7.p2.1)\.
- G\. Mialon, C\. Fourrier, C\. Swift, T\. Wolf, Y\. LeCun, and T\. Scialom \(2024\)GAIA: a benchmark for general AI assistants\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Note:Accessed: 2026\-01\-06External Links:[Link](https://openreview.net/forum?id=fibxvahvs3)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p3.1),[§7](https://arxiv.org/html/2605.14002#S7.p2.1)\.
- R\. Nakano, J\. Hilton, S\. Balaji, J\. Wu, L\. Ouyang, C\. Kim, C\. Hesse, S\. Jain, V\. Kosaraju, W\. Saunders, X\. Jiang, K\. Cobbe, T\. Eloundou, G\. Krueger, K\. Button, M\. Knight, B\. Chess, and J\. Schulman \(2021\)WebGPT: Browser\-assisted question\-answering with human feedback\.arXiv preprint arXiv:2112\.09332\.Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p2.1)\.
- OpenAI \(2025a\)GraphWalks: a multi\-hop long\-context graph reasoning benchmark\.Note:[https://huggingface\.co/datasets/openai/graphwalks](https://huggingface.co/datasets/openai/graphwalks)Released with GPT\-4\.1; results reported in the GPT\-4\.1 blog post\.Cited by:[§7](https://arxiv.org/html/2605.14002#S7.p1.1)\.
- OpenAI \(2025b\)Introducing deep research\.Note:[https://openai\.com/index/introducing\-deep\-research/](https://openai.com/index/introducing-deep-research/)OpenAI release, published February 2, 2025Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p3.1)\.
- OpenAI \(2025c\)OpenAI\-MRCR \(Multi\-Round Coreference\)\.Note:[https://huggingface\.co/datasets/openai/mrcr](https://huggingface.co/datasets/openai/mrcr)OpenAI dataset/eval introduced in the GPT\-4\.1 release on April 14, 2025; results used in this paper are provided infinal\_eval/mrcr\_score\.csv\.Cited by:[§2\.3](https://arxiv.org/html/2605.14002#S2.SS3.p1.1),[§7](https://arxiv.org/html/2605.14002#S7.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Note:OpenReview: accessed 2026\-01\-05External Links:[Link](https://openreview.net/forum?id=2GmDdhBdDk)Cited by:[§2\.3](https://arxiv.org/html/2605.14002#S2.SS3.p1.1)\.
- Perplexity AI \(2025\)Introducing Perplexity Deep Research\.Note:[https://www\.perplexity\.ai/hub/blog/introducing\-perplexity\-deep\-research](https://www.perplexity.ai/hub/blog/introducing-perplexity-deep-research)Perplexity AI BlogCited by:[§1](https://arxiv.org/html/2605.14002#S1.p3.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: Language Models Can Teach Themselves to Use Tools\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p2.1)\.
- U\. Shaham, M\. Ivgi, A\. Efrat, J\. Berant, and O\. Levy \(2023\)ZeroSCROLLS: A Zero\-Shot Benchmark for Long Text Understanding\.InFindings of the Association for Computational Linguistics: EMNLP 2023,Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00475),[Link](https://aclanthology.org/2022.tacl-1.31/)Cited by:[§7](https://arxiv.org/html/2605.14002#S7.p2.1)\.
- K\. Vodrahalli, S\. Ontanon, N\. Tripuraneni, K\. Xu, S\. Jain, R\. Shivanna, J\. Hui, N\. Dikkala, M\. Kazemi, B\. Fatemi, R\. Anil, E\. Dyer, S\. Shakeri, R\. Vij, H\. Mehta, V\. Ramasesh, Q\. Le, E\. Chi, Y\. Lu, O\. Firat, A\. Lazaridou, J\. Lespiau, N\. Attaluri, and K\. Olszewska \(2024\)Michelangelo: long context evaluations beyond haystacks via latent structure queries\.arXiv\(arXiv:2409\.12640\)\.Note:version 2External Links:[Link](http://arxiv.org/abs/2409.12640),[Document](https://dx.doi.org/10.48550/arXiv.2409.12640)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1),[§7](https://arxiv.org/html/2605.14002#S7.p1.1)\.
- J\. Wei, Z\. Sun, S\. Papay, S\. McKinney, J\. Han, I\. Fulford, H\. W\. Chung, A\. T\. Passos, W\. Fedus, and A\. Glaese \(2025\)BrowseComp: a simple yet challenging benchmark for browsing agents\.External Links:2504\.12516,[Document](https://dx.doi.org/10.48550/arXiv.2504.12516),[Link](http://arxiv.org/abs/2504.12516)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p2.1),[§1](https://arxiv.org/html/2605.14002#S1.p3.1),[§7](https://arxiv.org/html/2605.14002#S7.p2.1)\.
- xAI \(2025\)Grok 4 Fast Model Card\.Note:[https://x\.ai/model\-card/grok\-4\-fast](https://x.ai/model-card/grok-4-fast)xAI Technical ReportCited by:[§2\.3](https://arxiv.org/html/2605.14002#S2.SS3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui,et al\.\(2025a\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388),[Document](https://dx.doi.org/10.48550/arXiv.2505.09388)Cited by:[§2\.3](https://arxiv.org/html/2605.14002#S2.SS3.p1.1)\.
- Y\. Yang, Z\. Huang, W\. Zhu, Z\. Qiu, F\. Yuan, J\. Z\. Pan, and I\. Titov \(2025b\)A controllable examination for long\-context language models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=atjpGqjG73)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1),[§7](https://arxiv.org/html/2605.14002#S7.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: Towards Scalable Real\-World Web Interaction with Grounded Language Agents\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p3.1)\.
- H\. Yen, T\. Gao, M\. Hou, K\. Ding, D\. Fleischer, P\. Izsak, M\. Wasserblat, and D\. Chen \(2025\)HELMET: how to evaluate long\-context models effectively and thoroughly\.InThe Thirteenth International Conference on Learning Representations,Note:OpenReview: https://openreview\.net/forum?id=293V3bJbmEExternal Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/hash/f5332c8273d02729730a9c24dec2135e-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1),[§7](https://arxiv.org/html/2605.14002#S7.p1.1)\.
- X\. Zhang, Y\. Chen, S\. Hu, Z\. Xu, J\. Chen, M\. K\. Hao, X\. Han, Z\. L\. Thai, S\. Wang, Z\. Liu, and M\. Sun \(2024\)26\(26\)infty\\@@lbibitem\{\}\\NAT@@wrout\{26\}\{\}\{\}\{\}\{\(26\)\}\{\}\\lx@bibnewblock inftyBench: Extending Long Context Evaluation Beyond 100K Tokens\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: A Realistic Web Environment for Building Autonomous Agents\.InThe Twelfth International Conference on Learning Representations \(ICLR\),Cited by:[§1](https://arxiv.org/html/2605.14002#S1.p2.1)\.
## 9Limitations
First, due to budget constraints and practical model selection, we do not evaluate the largest and most expensive frontier\-scale models\. Such models may reveal a clearer connection \(or a different relationship\) between Reasoning*in*Context and Reasoning*through*Context\.
Second, although we provide cached results for reproducibility, benchmark outcomes may still shift over time due to changes in the underlying search engine and the evolving web \(ranking drift, content updates, and availability\)\.
Third, our static\-context LRM baselines are constructed from evidence corpora produced by agentic collection runs\. This yields a controlled comparison, but it may not fully represent long\-context performance under independently collected evidence\. Thus, we can not conclude that reasoning through context is better than reasoning in context in this research\.
## 10Ethical Considerations
PolitNuggets is constructed from human\-related information available in the public domain \(e\.g\., Wikipedia, official government pages, and public news/biographical sources\)\. We adhere to fair\-use principles for research and release only cached materials necessary for replication\. We do not intentionally collect or disclose private or sensitive personal information beyond what is already publicly available, and we do not include any leaked/private datasets\.
## 11Potential Risks
The agentic biography\-construction techniques evaluated here could in principle be repurposed to profile private individuals or non\-public figures; we therefore restrict our benchmark to public officials whose career information is a matter of public record\. Model\-generated biographies can also contain factual errors that, if redistributed uncritically, could harm the subjects or propagate misinformation; we mitigate this by scoring only against evidence\-verified nuggets and by releasing the cached source pages so that each claim can be audited\. Finally, the International Evidence Gap we document indicates that naive deployment of such systems may disproportionately under\-represent non\-US and non\-English political figures, and this limitation should be explicitly disclosed in any downstream use\.
## 12AI Usage
We used AI tools to assist with \(i\) implementation and refactoring of the evaluation and analysis code, \(ii\) plotting and figure formatting, and \(iii\) proofreading and minor style edits of the manuscript\. All experimental design decisions, evaluations, and reported results were reviewed by the authors, and we validated code changes by re\-running the pipeline to reproduce tables and figures\.
## Appendix AAdditional Analysis
### A\.1Experiment Details
#### A\.1\.1Deliverables
To support replication, we release a code repository222[https://github\.com/yifeifrank/poli\_searcher](https://github.com/yifeifrank/poli_searcher)containing the Supervisor–Searcher agentic pipeline, together with a data release333[https://huggingface\.co/datasets/frankyifei/politnuggets](https://huggingface.co/datasets/frankyifei/politnuggets)containing the following three artifacts:
1. 1\.Consolidated Ground Truth \(CGT\)\.The final pooled, evidence\-verified biography nuggets for all 400 entities \(including the Wikipedia\-coverage filterWeW\_\{e\}\), which define the evaluation targetGGand the dynamic novelty setG′G^\{\\prime\}\.
2. 2\.Cached webpages\.The raw retrieved web pages collected during our agentic runs, fixing the search snapshot used for all reported numbers and enabling offline re\-evaluation\.
3. 3\.LRM evaluation package\.A curated static\-context dataset \(Archive\-style short context and long\-context corpora derived from the cached pages\) for evaluating long\-context biography extraction without interactive search, enabling controlled comparison of “Reasoning*in*Context” across models\.
All LRM\-baseline and FactNet evaluation procedures are fully specified by the prompts in Appendix[A\.5](https://arxiv.org/html/2605.14002#A1.SS5)together with the released artifacts above, allowing reassembly without a dedicated code release\.
##### Infrastructure, tools, and cost accounting\.
We implemented the full Supervisor–Searcher pipeline inlanggraph\. For LLM inference, we used OpenRouter as the API provider and recorded token usage using OpenRouter’s standardized token accounting\. For web search, we used the Serper API; for page retrieval, we used the scrawling service by Jina and Exa\. To maximize robustness at scale, we used multiple layers of retry/backoff to ensure successful search and retrieval\. Across the full experimental campaign \(including development/testing\), we issued approximately300k300\\text\{k\}searches \(about $300 in search cost\) and retrieved pages at an additional cost of about $50 \(including free\-tier, in\-limit usage\)\. Overall, the total third\-party API spend for the project \(including LLM APIs, search, and retrieval\) was approximately $3,750\.
##### Budget controls and termination criteria\.
We enforced two complementary termination criteria to bound the system’s budget\. First, the Supervisor maintains a to\-do list and allocates a bounded amount of*dedicated research*per item: for each to\-do item, the Searcher is allowed at most three focused search–retrieve attempts; if the item remains unresolved after three attempts, the Supervisor abandons that branch and proceeds to other items to avoid pathological loops\. Second, we impose a hard\-coded global cap of 100 LLM calls per run to bound worst\-case cost and latency\.
##### Experimental timeline and operational exclusion\.
Our primary data collection was conducted in September 2025 \(Gemini and Qwen families\), and we added Grok\-4\-Fast in November 2025 given its strong tool\-use reliability and direct relevance to real\-world agentic deployments\. We also attempted to evaluate GPT\-5\-Mini, but in preliminary runs it frequently failed to terminate within the maximum conversation times budget \(T=100T=100\), exhibiting repetitive search loops and unfinished trajectories; we therefore exclude it from the final comparison to preserve integrity of the evaluation\. However, due to it’s strong long context performance and affordability, we used as the judge LRMs in this article\.
##### Consistency and validity of the LLM judge\.
To assess the reliability of our evidence\-conditional LLM judge, we manually re\-judged the final scores for 40 randomly selected officials and compared them against the LLM judge outputs\. The human and LLM judgments are consistent, with a correlation of 0\.87\. As an additional validity check, we used Exa to further fact\-check a random sample of 100 officials: out of 2,243 validated nuggets, we identified 82 false positives, corresponding to an inaccuracy rate of≈3\.66%\\approx 3\.66\\%\. The manual re\-judging was performed by four student annotators recruited through the authors’ university network\. Each annotator was compensated at HKD 70 per hour, which is the prevailing rate for comparable student research assistance in Hong Kong\. Annotators were informed in writing about the purpose of the task \(validating LLM\-generated biography assessments of public political figures\), the use of their labels \(aggregate statistics only, no personal data collected\)\. The instruction given are similar to prompts used in research\.
##### Architectural ablation: the necessity of memory\.
To validate the Archive memory mechanism added on top of the base Supervisor–Searcher loop, we conduct ablation studies on the Grok\-4 baseline \(Figure[6](https://arxiv.org/html/2605.14002#A1.F6)\)\. We compare the full system against a No\-Archive variant \(where evidence persistence is disabled and the Supervisor relies only on the Searcher’s summaries\) and a Report\-Only variant\.
Figure 6:Architectural Ablation \(Coefplot\)\. Removing the Archive \(“non\-archive”\) significantly degrades performance \(ΔF1≈−0\.05\\Delta\\text\{F1\}\\approx\-0\.05\), confirming that raw evidence persistence is crucial for longitudinal synthesis\.The results are unequivocal\. The coefficient for non\-archive is significantly negative \(β≈−0\.05\\beta\\approx\-0\.05at the Event\-Level\), validating our hypothesis: summaries are lossy\. Without the Archive Tool to persist raw evidence chunks, the agent suffers from “Contextual Amnesia,” hallucinating connections or failing to link dependent facts\.
### A\.2Statistical significance
To support the claims about setting and region gaps, we report distributional summaries and bootstrap confidence intervals for the underlying per\-entity metrics\. Table[2](https://arxiv.org/html/2605.14002#A1.T2)provides key contrasts \(mean deltas with bootstrap 95% CIs; “CI excludes 0” indicates statistical significance at the 5% level\)\.
In table 2, each row corresponds to one*difference*computed over entities\.comparisonspecifies the direction: \(Non\-US minus US\) isE\[Non\-US\]−E\[US\]\\mathrm\{E\}\[\\text\{Non\-US\}\]\-\\mathrm\{E\}\[\\text\{US\}\]within a fixed context/model; \(Without wiki minus With wiki\) isE\[Without\]−E\[With\]\\mathrm\{E\}\[\\text\{Without\}\]\-\\mathrm\{E\}\[\\text\{With\}\]for the same model\.levelindicates whether the outcome isDiscovery\(event\-level\),Granularity\(attribute\-level\), orEfficiency\(cost\)\.metricis the quantity being compared \(f1,steps, ortokens\_total\)\.n\_usandn\_non\_usare the effective sample sizes \(completed entities\) for the two groups, andmean\_us/mean\_non\_usare the corresponding sample means\.deltais the mean difference in the direction defined bycomparison\.ci95\_lowandci95\_highare the bootstrap 95% confidence interval bounds fordelta, andci95\_excludes\_0isTruewhen the CI does not cross 0 \(significant atα=0\.05\\alpha=0\.05\)\.cohen\_dreports a standardized effect size \(sign followsdelta\)\. We use unpaired bootstrap resampling forNon\-US minus US\(different entity sets\) and paired bootstrap forWithout wiki minus With wikiwhen comparing the same entities across settings\.
Table 2:Statistical evidence for key differences\. We report mean deltas with bootstrap 95% confidence intervals \(CIs\) computed over entities; “CI excludes 0” indicates significance\.
### A\.3Full experiment results
#### A\.3\.1LRM baseline results
Table 3:LRM baseline results \(Reasoning*in*Context\)\. We report F1 for biographies generated directly from three static contexts:Short\(Archive\),Long\(raw retrieved pages\), andMemory\(memory\-only bio\)\. Easy corresponds to EventF1 and Hard corresponds to AttrF1\.##### LRM baseline findings\.
First, both Short and Long LRM bios underperform the best agentic setting \(e\.g\., Grok\-4\-Fast With Wiki: EventF1 0\.768 / AttrF1 0\.501\), despite operating over evidence collected from the same sessions\. Second, Long\-context performance is consistently worse than Short\-context performance; for Grok\-4\-Fast \(US, EventF1\) the drop is\(0\.626−0\.538\)×100/0\.626≈14\.1%\(0\.626\-0\.538\)\\times 100/0\.626\\approx 14\.1\\%, reflecting degradation under long, noisy contexts\. Third, the Memory\-only baseline is uniformly low, suggesting that while memory leakage exists, it is not the deterministic driver of success in this task compared to evidence\-grounded extraction\.
#### A\.3\.2Precision and recall breakdown
Table 4:Precision/recall/F1 breakdown \(novel\)\. We reportPrecision,Recall, andF1for both Event\-Level \(discovery\) and Attribute\-Level \(slot filling\) evaluation, by context and region\.##### Breakdown\.
Table[4](https://arxiv.org/html/2605.14002#A1.T4)confirms that the dominant gap is coverage rather than factuality: across models, EventPrec is consistently high while EventRec is substantially lower, and the drop is even sharper at the Attribute\-Level \(month/title matching\)\. Notably, Gemini DeepResearch exhibits the highest precision in the With\-wiki setting \(EventPrec US/Non\-US: 0\.912/0\.892\) but lower recall than our best agentic model \(EventRec US/Non\-US: 0\.678/0\.577 vs\. Grok\-4\-Fast 0\.703/0\.620\), indicating a more conservative operating point\. This precision–recall shape supports our framing that longitudinal synthesis failures are primarily due to missed weakly connected long\-tail events, especially for Non\-US entities\.
#### A\.3\.3Model capability analysis \(Attribute\-Level\)
Figure[7](https://arxiv.org/html/2605.14002#A1.F7)presents the diagnostic analysis for the Attribute\-Level evaluation\. The trends largely mirror the Event\-Level setting, though the correlations are noisier due to the overall lower performance ceiling under strict month\-level matching\.
Figure 7:Model Capability Analysis \(Attribute\-Level\)\. The same 2×\\times3 diagnostic grid as Figure[5](https://arxiv.org/html/2605.14002#S5.F5), with panels \(a\)–\(f\), for the Attribute\-Level evaluation\.
### A\.4Case Study: Candidate vs\. Ground Truth Biography Tables
We provide a qualitative case study for one Non\-US political figure to illustrate how PolitNuggets evaluates both discovery and attribute\-level extraction\. TableLABEL:tab:case\_study\_candidateslists the agent\-generated candidate biography entries and their evidence support status under the FactNet protocol, while TableLABEL:tab:case\_study\_cgtlists the corresponding ground\-truth \(CGT\) biography entries and their match categories against the agent output\.
#### A\.4\.1Single\-agent run history \(Erik Solheim\)
Following the Supervisor–Searcher design \(Figure[3](https://arxiv.org/html/2605.14002#S3.F3)\), TableLABEL:tab:case\_study\_historyorganizes one representative run into a structured workflow\. The table makes explicit the Supervisor’s reasoning \(instructions/goals\) and the Searcher’s execution \(queries, sources, and archived observations\), including the tools used and the primary language of each step\.
Table 5:The Agentic System: research history for Erik Solheim\. The table highlights period/step structure, tool usage, query language, sources, and Archive updates\.PeriodStepSupervisor instruction / goalTool useLangSearcher actions \(key queries & targets\)Retrieved evidence & Archive update \(incl\. missing\)Phase 1 \(msgs 0–7\): Global goal0Create a comprehensive, evidence\-based biography and timeline for Erik Solheim, focusing on career milestones, family details, and education history\.Search; Browse/Retrieve; Archive; CoderNO/ENInitialize a to\-do list and begin evidence collection from high\-precision sources\.Start with a seed biography skeleton and gaps list to drive subsequent targeted search\.Phase 1 \(msgs 0–7\): Initial skeleton1\.1Perform a comprehensive initial sweep: gather basic biographical details \(birth year/place/gender\), party affiliations, and major career milestones\. Prioritize official Norwegian government sources and Wikipedia\.Search; Browse/Retrieve; ArchiveNO/ENErik Solheim AND \(biografi OR miljøminister OR SV OR født\)\.
Target sites:no\.wikipedia\.org,regjeringen\.no,stortinget\.no\.Evidence:
*no\.wikipedia\.org*: “Erik Solheim \(født 1955\)…Sjef for FNs miljøprogram…SV…MDG \(2019–\)…”
*regjeringen\.no \(CV\)*: “…unique in holding the combined portfolio of Minister of the Environment…1987–1997 Leader for Sosialistisk Venstreparti…”
*stortinget\.no*: “Solheim, Erik \(1955\-\)…Foreldre: Bjørn Elling Solheim og Sophie Grindstad…Cand\.mag\. 1980\.”
Archive update:identity confirmed \(b\. 1955\); baseline timeline built \(SV leadership 1987–97; ministerial roles 2005–12; UNEP 2016–18\)\.
Missing:spouse names, specific early education \(primary schools\), and details on the 1985–1987 career gap\.Phase 2 \(msgs 8–20\): Gap filling \(family & mid\-career\)2\.1Target Phase 2 expansion: find precise details on early education, spouse name, career gaps \(specifically 1985–1987 at Norges Handikapforbund\), and post\-2018 positions\.Search; Browse/Retrieve; ArchiveNO/ENErik Solheim AND \(kone OR Sissel Bugge OR Gry Ulverud\)\.
Erik Solheim AND \(Norges Handikapforbund OR konsulent\)\.
Erik Solheim AND \(post\-2018 OR MDG OR Green Belt\)\.Evidence:
*snl\.no*: “Gift 1\) 1989 med Sissel Bugge…Gift 2\) 2004 med Gry Høeg Ulverud…Konsulent i Norges Handikapforbund 1985–1987\.”
*linkedin\.com*: “…President Europe\-Asia Center \(Jul 2023\-\)…Champion Global Renewables Alliance \(Nov 2024\-\)…”
Archive update:validated marriages and children; confirmed “Consultant” role at Handikapforbund \(1985–87\); mapped extensive post\-2018 advisory roles \(Green Belt, WRI, etc\.\)\.
Missing:still lacks specific names of primary and lower secondary schools \(only Oslo katedralskole is known\)\.Phase 3 \(msgs 21–40\): Deep dive \(early life & education\)3\.1Identify primary and secondary schools attended before Oslo katedralskole; check memoirs or interviews for early life details in the Bislett area\.Search; Browse/Retrieve; ArchiveNOErik Solheim AND \(skole OR Bolteløkka OR Ila OR Bislett\)\.
Erik Solheim AND \(oppvekst OR barndom OR tidlig liv\)\.Evidence:
*vg\.no \(Factbox\)*: “…Oppvokst på Bislett, gikk på Bolteløkka skole, Ila skole og Oslo katedralskole\.”
*nettavisen\.no*: “…Vokst opp: Bislett…Bor nå: St\. Hanshaugen\.”
Archive update:success \(“needle in the haystack”\); new data: Primary school*Bolteløkka skole*and lower secondary*Ila skole*\.
Result:timeline is now continuous from childhood through 2025\.Final output4Compile the archived snippets into the final structured dataset for evaluation\.CoderN/AGenerate finalAgentic Bio \(JSON\)from the accumulated Archive\.Output:the Coder agent compiles the archived snippets into a structured JSON biography\.##### Run summary\.
The run begins with a global goal to construct an evidence\-grounded biography and to identify gaps\. In Phase 1, the Searcher uses Norwegian and English queries to prioritize official sources and Wikipedia, confirming identity \(b\. 1955\) and establishing the core timeline \(SV leadership 1987–97; ministerial roles 2005–12; UNEP 2016–18\) while leaving spouse names, early schooling, and a mid\-career gap unresolved\. In Phase 2, targeted queries surface family details and the 1985–1987 Norges Handikapforbund role from*snl\.no*, and post\-2018 roles from*linkedin\.com*, narrowing the remaining gap to early\-life schools\. In Phase 3, Norwegian queries focused on Bislett uncover the missing primary and lower\-secondary schools \(Bolteløkka skole; Ila skole\) from*vg\.no*and corroborating profile details from*nettavisen\.no*, completing a continuous education timeline\. Finally, the Coder compiles archived evidence into a structured JSON output for evaluation \(shown below shortened for brevity\):
```
{
"codebook_results": {
"full_name": "Erik Solheim",
"birth_date": "1955.01.18",
"education_experiences": [
{
"organization_name": "Boltelkka skole",
"education_level": "Primary school",
"notes": "Bislett area Oslo"
},
{
"organization_name": "Ila skole",
"education_level": "Lower secondary"
},
{
"organization_name": "Oslo katedralskole",
"education_level": "High school",
"notes": "Examen artium 1973/74"
},
{
"organization_name": "Universitetet i Oslo (UiO)",
"education_level": "Master",
"notes": "Cand.mag. 1980"
}
],
"occupation_experiences": [
{
"time_range": "2007-2012",
"position_title": "Minister of the Environment"
},
{
"time_range": "2016-2018",
"position_title": "Executive Director UNEP"
}
]
}
}
```
Table 6:Case study \(Grok candidates\): candidate biography entries and evidence support category\.TypeCandidate EntrySupport CategoryEducation1961\.01–1969\.12 \| Bolteløkka skole \| Primary schoolFULLY\_SUPPORTEDEducation1969\.01–1972\.12 \| Ila skole \| Lower secondaryFULLY\_SUPPORTEDEducation1970\.01–1974\.12 \| Oslo katedralskole \| High schoolFULLY\_SUPPORTEDEducation1974\.01–1980\.12 \| Universitetet i Oslo \(UiO\) \| MasterFULLY\_SUPPORTEDEducation1974\.01–1980\.12 \| Universitetet i Oslo \| StudentFULLY\_SUPPORTEDParty1977\.01–1997\.05 \| Sosialistisk Venstreparti \(SV\) \| Member/LeaderFULLY\_SUPPORTEDParty1977\.01–1980\.12 \| Sosialistisk Ungdom \(SU\) \| LeaderFULLY\_SUPPORTEDParty1981\.01–1985\.12 \| Sosialistisk Venstreparti \(SV\) \| PartisekretærFULLY\_SUPPORTEDParty1987\.01–1997\.05 \| Sosialistisk Venstreparti \(SV\) \| Party LeaderFULLY\_SUPPORTEDParty2019\.01–Present \| Miljøpartiet De Grønne \(MDG\) \| Member/AdvisorFULLY\_SUPPORTEDParty2019\.01–Present \| Miljøpartiet De Grønne \(MDG\) \| AdvisorFULLY\_SUPPORTEDCareer1985\.01–1987\.12 \| Norges Handikapforbund \| KonsulentFULLY\_SUPPORTEDCareer1989\.10–1993\.09 \| Stortinget \| Stortingsrepresentant Sør\-TrøndelagFULLY\_SUPPORTEDCareer1993\.10–2001\.09 \| Stortinget \| Stortingsrepresentant OsloFULLY\_SUPPORTEDCareer2000\.03–2005\.12 \| Utenriksdepartementet \(UD\) \| SpesialrådgiverFULLY\_SUPPORTEDCareer2005\.10–2007\.10 \| Utenriksdepartementet \(UD\) \| UtviklingsministerFULLY\_SUPPORTEDCareer2007\.10–2012\.03 \| Miljøverndepartementet / Utenriksdepartementet \| Miljøvernminister \+ UtviklingsministerFULLY\_SUPPORTEDCareer2013\.01–2016\.12 \| OECD \| Chair Development Assistance Committee \(DAC\)FULLY\_SUPPORTEDCareer2016\.01–2018\.11 \| UN Environment Programme \(UNEP\) \| Executive DirectorFULLY\_SUPPORTEDCareer2017\.01–Present \| BRIGC / Green Belt and Road Institute \| President/ConvenerFULLY\_SUPPORTEDCareer2019\.01–2023\.12 \| APRIL / World Resources Institute \(WRI\) / TREELION \| Environment/Senior Advisor / Co\-ChairFULLY\_SUPPORTEDCareer2021\.01–2023\.12 \| Aker Horizons / Morrow Batteries \| Industrial/Environment AdvisorPARTIALLY\_SUPPORTEDCareer2023\.07–Present \| Europe\-Asia Center \| PresidentFULLY\_SUPPORTEDCareer2024\.11–Present \| Global Renewables Alliance \| ChampionFULLY\_SUPPORTEDRelativesfather \| Bjørn Elling SolheimFULLY\_SUPPORTEDRelativesmother \| Sophie GrindstadFULLY\_SUPPORTEDRelativesex\-spouse \| Sissel BuggeFULLY\_SUPPORTEDRelativesspouse \| Gry Høeg UlverudFULLY\_SUPPORTEDRelativeschild \| Øyvind SolheimFULLY\_SUPPORTEDRelativeschild \| Mari SolheimFULLY\_SUPPORTEDRelativeschild \| Aksel SolheimFULLY\_SUPPORTEDRelativeschild \| Sofie SolheimFULLY\_SUPPORTEDRelativessibling \| NAFULLY\_SUPPORTEDTable 7:Case study \(ground truth / CGT\): ground\-truth biography entries and match category against Grok output\.TypeCGT EntryMatch CategoryEducation1961\.01–1969\.12 \| Bolteløkka skole \| Primary schoolFULL\_MATCHEducation1969\.01–1972\.12 \| Ila skole \| Lower secondaryFULL\_MATCHEducationNA–1974\.01 \| Oslo Cathedral School \| High schoolFULL\_MATCHEducation1975\.01–1980\.01 \| University of Oslo \| cand\.mag\. degreeFULL\_MATCHParty1977\.01–1981\.01 \| Socialist Youth \| LeaderFULL\_MATCHParty1981\.01–1985\.01 \| Socialist Left Party \| Party SecretaryFULL\_MATCHParty1985\.01–1987\.12 \| Socialist Left Party \| Member of the Central Executive CommitteeNO\_MATCHParty1987\.04–1997\.05 \| Socialist Left Party \| Party LeaderPARTIAL\_MATCHParty1989\.10–2019\.01 \| Socialist Left Party \| MemberPARTIAL\_MATCHParty2019\.01–Present \| Green Party \| MemberFULL\_MATCHCareer1974\.01–1975\.01 \| Norwegian Air Force \| ConscriptNO\_MATCHCareer1985\.01–1987\.12 \| Norges Handikapforbund \| ConsultantFULL\_MATCHCareer1989\.10–2001\.09 \| Parliament of Norway \| Member of ParliamentFULL\_MATCHCareer2000\.03–2005\.10 \| Ministry of Foreign Affairs \| Special AdviserFULL\_MATCHCareer2005\.10–2012\.03 \| Government of Norway \| Minister of International DevelopmentFULL\_MATCHCareer2007\.10–2012\.03 \| Government of Norway \| Minister of the EnvironmentFULL\_MATCHCareer2012\.03–2013\.01 \| Ministry of Foreign Affairs \| Special AdviserNO\_MATCHCareer2013\.01–2016\.06 \| OECD \| Chair of Development Assistance CommitteeFULL\_MATCHCareer2016\.06–2018\.11 \| United Nations Environment Programme \| Executive DirectorPARTIAL\_MATCHCareer2018\.11–Present \| Belt and Road Green Development Coalition \| Vice PresidentPARTIAL\_MATCHCareer2018\.11–Present \| Climate Council of Chief Minister MK Stalin \| MemberNO\_MATCHCareer2018\.11–Present \| Global Solar Council \| Global AmbassadorNO\_MATCHCareer2018\.11–Present \| Global Wind Energy Council \| AdviserNO\_MATCHCareer2018\.11–Present \| Green Hydrogen Organization \| ChairmanPARTIAL\_MATCHCareer2018\.11–Present \| International Hydropower Association \| Board MemberNO\_MATCHCareer2019–Present \| Green Belt and Road Institute \| PresidentFULL\_MATCHCareer2019–Present \| World Resources Institute \| Senior AdviserFULL\_MATCHCareer2019\.05–Present \| Plastic REVolution Foundation \| CEONO\_MATCHRelativesfather \| Bjørn Elling SolheimFULL\_MATCHRelativesmother \| Sophie GrindstadFULL\_MATCHRelativesformer spouse \| Sissel BuggeFULL\_MATCHRelativesspouse \| Gry UlverudFULL\_MATCHRelativeschild \| Aksel SolheimFULL\_MATCHRelativeschild \| Mari SolheimFULL\_MATCHRelativeschild \| Sofie SolheimFULL\_MATCHRelativeschild \| Øyvind SolheimFULL\_MATCH
### A\.5Prompts used in the research
We list and release the exact prompt templates used in our pipeline, grouped by stage\.
Table 8:Prompt templates used in the research\.StagePromptArchitectureSupervisor promptArchitectureSearcher prompt \(Archive on\)ExperimentQuery template \(EN\)ExperimentResearch plan template \(EN\)EvaluationFact\-checking \(related\-content judge\) promptEvaluationEntrywise evaluation prompt#### A\.5\.1Architecture prompts
##### Supervisor prompt\.
YouaretheSupervisorforamulti\-stepdeepwebresearchagent\.
Youreasonbasedonthestructuredstate:
\-Researchrequest\(userquery,constraints,codebook\)
\-Searchbatchhistory\(eachbatch\_overviewwithsupervisor\_task\_instruction,research\_summary,detailed\_analysis\)
\-todo\_list\(remainingsearchgapswith\[k\]counters\)
\-global\_summary\(runningsummaryoffindingssofar\)
Eachturnyoumust:
1\)Update‘global\_summary‘soitisareadable,self\-containedsummaryofallsolidfactsfoundsofar\.
2\)Update‘todo\_list‘soitreflectstheremainingimportantgaps\.
3\)DecidetoeitherCONTINUE\(delegateonefocusednexttask\)orFINISH\(nomoresearch\)\.
OUTPUTFORMAT\(JSONONLY,noextratext,nomarkdownfences\):
\{
"todo\_list":"\.\.\.",
"next\_task\_instruction":"\.\.\.ornull",
"global\_summary":"\.\.\."
\}
Fieldrules:
\-‘global\_summary‘:
\-Treatasthesingleevolvingresearchsummary\.
\-Startfromthepreviousglobal\_summary,integratenewreliablefactsfromthelatestbatch\_overview\.
\-Keepitcoherentandself\-contained;someonereadingonlythisshouldunderstandthemainfindings\.
\-‘todo\_list‘:
\-Textblocklistingremaininggaps,typicallyaslineslike‘\[k\]<gapdescription\>‘\(plusoptionalheadings\)\.
\-Whenagapisfullyanswered,removeit\.
\-Whenpartiallyanswered,rewritetoexpressonlywhatisstillmissing\.
\-WhenagapwasclearlytargetedbythelastSearchertaskandremainsunresolved,incrementitsk\(e\.g\.‘\[1\]‘\-\>‘\[2\]‘\-\>‘\[3\]‘\)\.
\-Ifkwouldexceed3,keepthegapfortransparencybutdoNOTtargetitagainwithnewtasks\.
\-‘next\_task\_instruction‘:
\-Non\-emptystring=\>CONTINUEmode\.
\-null=\>FINISHmode\.
\-Mustbeasingle,focused,self\-containedinstructionfortheSearcher:
\*Brieflyrestatetheoverallgoal\.
\*ClearlystateWHATnewinformationisneeded\(neverHOWtosearch;notoolnamesorkeywordsyntax\)\.
CONTINUEmode\(non\-empty‘next\_task\_instruction‘\):
\-Usewhentherearestillimportantgapsintodo\_listthatareplausiblyanswerablebywebresearch\(preferk=1or2\)\.
\-Decomposebroadgapsintoconcretequestionswhenpossible\(e\.g\."exactdatesforroleX"insteadof"completecareerhistory"\)\.
\-Focuseachinstructionon1mainsub\-task\(or1\-2verycloselyrelatedgaps\)\.
FINISHmode\(‘next\_task\_instruction‘=null\):
\-Usewhenremaininggapsareminor,low\-value,orhavehighcounters\(\>3\),ortheuser’srequestissufficientlyanswered\.
\-Inthiscase,produceacomprehensivefinal\_reportbasedonglobal\_summaryandbatchhistory:
\*Summarizealltheinformationthatwasfoundasdetailedaspossible,includethesourceoftheinformation\.
\*Noteanymajorremaininguncertaintiesorunsolvedgaps\.
\*Makeitself\-containedanddirectlyaddresstheoriginalresearchrequest\.
Todayis\{current\_date\}\.
##### Searcher prompt \(Archive on\)\.
YouareaprofessionalSearchAgentexecutingaresearchtasktosearch,browse,andretrieveasbroadrelevantinformationaspossible\.Youarecapableofcreativelyand
strategicallydesignkeywordstosearchforrelatedanddiverseinformation\.Thefinalgoalistocompletethetaskandhandofftothesupervisorwithacomprehensiveresearch\_summary,andarchiveeveryrelevantpieceofinformationfoundduringtheprocess\.
\#\#\#UnderstandtheTask
\-Youreceivea\*\*self\-containedtaskinstruction\*\*fromtheSupervisorthatincludes:
\-Theoverallresearchgoal
\-Asummaryofwhathasbeenfoundsofar
\-Thespecificobjectiveforthissearchbatch
\-Anyrelevantconstraints
\-Readtheprovided‘current\_task\_instruction‘carefully
\-Theinstructionshouldcontainallcontextyouneed\(goal,priorfindings,currentobjective\)
\-Focusonthe\*\*specificobjective\*\*statedintheinstruction
\#\#YourCoreActionLoop
Yousearch,retrieve,andarchivetocompletethetask:
1\.Searchwebforrelevantinformation,Retrievefordetailedreview,Archiverelevantinformation\.
2\.Handofftothesupervisorifcollectedenoughinformation\.
\#\#\#ExecuteSearch
\-Call‘web\_search\(search\_intent=\.\.\.\)‘withastructuredsearchplan
\-‘any\_of‘meansatleastoneofthetermsinthelistshouldappearinresults\.
\-‘must\_include‘meansallofthetermsinthelistmustappearinresults\.
\-‘must\_not\_include‘meansnoneofthetermsinthelistmayappearinresults\.
\-Startbroad,thennarrowbasedonresults
\-Adjustthetermsin‘must\_include‘and‘any\_of‘tomakethesearchmorespecificormorebroadbasedonobservedresults\.
\-Avoidoverlyrestrictive‘must\_include‘terms
\-Mentiongenericmeta\-wordslikebiography,bio,profilein‘any\_of‘insteadof‘must\_include‘
\-OnlyusesiterestrictionswhenREALLYnecessary
\-Flexiblyusekeywordsindifferentlanguagesasappropriate
\-Youhave\*\{max\_search\_attempts\}\*searchattempts,usewisely\.
\#\#\#RetrieveURLsContentforBrowsing
\-Aftereach‘web\_search‘call,call‘retrieve\_documents\(urls=\[\.\.\.\]\)‘forthe\*\*promising\*\*URLsfromthelatestresults\.
\-Selectupto10promisingURLsperretrievecall\.
\-Skipretrievingifnoresultsappearrelevant\.
\#\#\#ArchiveRelevantDocuments
\-Archivedinformationwillbereviewedbythesupervisorforreference\-Foreachrelevantdocumentfoundduringbrowsing,call‘archive\_document\(detailed\_analysis=\[\.\.\.\]\)‘:
\-‘url‘:DocumentURL
\-‘title‘:Documenttitle
\-‘task\_summary‘:Summaryofhowthisdocumentaddressesthetask
\-‘relevant\_chunk\_labels‘:Listofchunklabelsforrelevantparagraphs\(e\.g\.,\["\[CHUNK:abc12345:001\]","\[CHUNK:abc12345:002\]"\]\)
\-Archiveeverypieceofinformationthatisrelevanttothetask\.
\-Shouldhavearchivedallrelevantdocumentsbythetimeyouhandoff\.
\#\#HandofftoSupervisor
Whenthetaskiscomplete,call‘handoff\_to\_supervisor\_with\_overview‘:
\-‘research\_summary‘:Comprehensivenarrativeincluding:
\-\*\*Whatwasfound\*\*:Specificinformationwithconcretedetails
\-\*\*Whatislacking\*\*:Informationnotfoundoruncertain
\-‘search\_intent\_summary‘:Feedbackonsearcheffectiveness:
\-‘bad\_must\_include‘:Termsthatperformedpoorly
\-‘good\_any\_of‘:Termsthatworkedwell
\-‘search\_languages‘:Languagesusedinsearches
\#\#Tools\(USEONLYTHESE\)
\-web\_search\(search\_intent:object\)\-executesearch
\-retrieve\_documents\(urls:list\[string\]\)\-fetchandchunkdocumentcontentfromURLs
\-archive\_document\(detailed\_analysis:list\[object\]\)\-archiveeveryrelevantchunkfoundduringbrowsingtostorageforfuturereference
\-handoff\_to\_supervisor\_with\_overview\(research\_summary:string,search\_intent\_summary:object\)\-finalhandoff
\#\#Important:
\#Maintainloopsofsearch,retrieve,andarchivetocompletethetaskincrementally\.
\#Handoffwhenthetaskiscomplete\.
\#Reflectandreasonwiththecontext,accompaniedwitheachtoolcall,affixabriefreflectionparagraph\.
\#\#Context
Todayis\{current\_date\}\.
#### A\.5\.2Experiment prompts
##### Query template \(EN\)\.
Findcomprehensivepublicinformationabout\{current\_name\},apoliticalorpublicfigure\{country\_clause\}\{occupation\_clause\}\{year\_clause\}\.
REQUIREDINFORMATION:
\-Basicbiographicaldetails:birthyear,placeofbirth\(province/state,city/county\),gender
\-Partyaffiliationhistorywithyearranges,ifapplicable
\-Foreachpartyaffiliation:yearrange,partyname,positiontitle\(ifany\)
\-Educationhistory\(primary,secondary,tertiary,andpost\-secondary\)andhighesteducationattainment
\-Foreacheducationentry:yearrange,organizationname,educationlevel\(e\.g\.,Belowhighschool/Highschool/Bachelor/Master/Doctorate/Diploma/Certificate\),major/field
\-Occupation/careertimelinewithorganizations,positions,andyearranges
\-Foreachrole:yearrange,organizationname,positiontitle,employed/unemployed
\-Family/relatives\(ifavailable\):relation\(spouse/grandparents/parents/children/siblings\)andnameonly
\-Deathstatusandyearrange,ifapplicable
\-Ifthereisnodefinitiveinformationondeath,assumetheindividualisstillalive\.
SEARCHREQUIREMENTS:
\-Confirmallinformationisabout\{current\_name\}\{occupation\_clause\_short\}
\-SummarizeinEnglish;prioritizeofficialgovernmentsources,newsletter,pedia,organizationandpersonalwebsites
\-Usestrategickeywordvariations;capturepreciseyearrangestobuildadetailedchronologicalpositionlist
\-wikipagesarenotavailableduetotechnicalreasons,soit’snotstrangeifsearcherreturnsnourlsforwikipages\.
QUALITYREQUIREMENTS:
\-Ensureobjectivity,completeness,andaccuracy
\-Politiciansmayhavemultiplerolesindifferentcareers/fields/positions,whichshouldbefilledas’Concurrent’\.
\-Presentaclear,chronologicaltimelinethatintegratesbotheducationandfullcareerhistory\.Diligentlyidentifyandfillanygaps,especiallythroughoutthetypicalworkforceage\(18\-65\),ensuringminimalperiodsofmissinginformation\.
\-Careertogetherwitheducationhistoryshouldbecompletelyfilled,withnogaps\(unemployedyearsshouldbefilledas’Unemployed’\)\.
OUTPUTFORMAT:
\-Includeacomprehensivenarrativebiography\(\>=600characters\)integratingalldetails\.
\-Includethesourceoftheinformation,credibleornot,ensurereproducibility\.
##### Research plan template \(EN\)\.
\#Phase1:ComprehensiveInitialSweep
1\.Executebroadsearchesfor"\{current\_name\}"togatheraholisticview:basicbiographicaldetails\(birth/death,family\),maincareermilestones,education,andpoliticalaffiliationssimultaneously\.
2\.Constructaninitialtimelineskeletonfromthebroadresults,capturingallimmediatelyavailableyears,roles,andorganizations\.
3\.Identifyuniqueidentifiers\(e\.g\.,specifickeywords,middlenames,knownassociations\)todisambiguatefromhomonyms\.
\#Phase2:TargetedExpansion&DetailEnrichment
1\.LeveragespecificentitiesfoundinPhase1\(e\.g\.,"PartyX","UniversityY","MinistryZ"\)toperformtargetedsearchesforprecisedates,specificpositiontitles,andmissingdetails\.
2\.Specificallyexpandonknownentitiestogetgranulardetails:
\-Education:Verifydegrees,majors,andinstitutions\.
\-PartyHistory:Clarifyrolesandaffiliationperiods\.
\-Career:Fleshoutconcurrentrolesandspecificjobtitlesusingorganization\-specifickeywords\.
\#Phase3:GapAnalysis&NarrativeSynthesis
1\.Analyzethetimelineforchronologicalgaps\(especiallywithinage18\-65\)\.Performspecificqueriestofillthesegaps\(e\.g\.,checkforprivatesectorworkorunlistedperiods\)\.
2\.Re\-verifyanyambiguousdatapoints\(e\.g\.,relatives,deathdateifunclear\)andfinalizethedataset\.
3\.Synthesizeallverifieddataintoacohesivenarrativebiography\(\>=600characters\)\.
#### A\.5\.3Evaluation prompts
##### Fact\-checking \(related\-content judge\) prompt\.
Youareacarefulfact\-checkingassistant\.
Yourtaskistoevaluate\*\*onebiographicalfact\*\*aboutapersonusingONLYthe
providedrelatedcontent\(snippetsaggregatedfrommultipleURLs\)\.
Personidentifier:\{official\_id\}
Personname:\{official\_name\}
Biographicalfacttocheck:
‘‘‘text
\{entry\}
‘‘‘
Relatedcontent\(thisisyourONLYevidencesource;donotuseoutsideknowledge\):
‘‘‘text
\{related\_content\}
‘‘‘
Instructions:
\-Decidewhetherthefactisfullysupported,partiallysupported,unclear,orcontradicted
bytherelatedcontent\.
\-Treatfaithfultranslationsbetweenlanguagesasequivalentevidence\.
\-Beprogressive:iftheevidenceisthinorambiguous,choose’likely\_true’\.
OutputJSONwithexactlythesefields:
\-entry\_text:theoriginalfacttext\(string\)
\-verdict:oneof’true’,’likely\_true’,’uncertain’,or’false’
\-rationale:1\-3sentencesexplainingyourverdict,citingkeyphrasesfromthecontent
\(butdoNOTinventnewfacts\)\.
DoNOTincludeanycommentaryoutsidetheJSONobject\.
##### Entrywise evaluation prompt\.
\#\#Task:EntrywiseBiographyEvaluation
Youareanexpertevaluatorofbiographicaldataextractionquality\.Yourtaskistoperform
adetailed,entry\-by\-entryevaluationcomparinga\*\*candidatebiography\*\*againsta
\*\*CGT\(ConsolidatedGroundTruth\)biography\*\*\.
\-\-\-
\#\#PersonInformation
\-\*\*OfficialID\*\*:\{official\_id\}
\-\*\*OfficialName\*\*:\{official\_name\}
\-\*\*ExperimentType\*\*:\{experiment\_type\}
\-\-\-
\#\#CoreEvaluationPrinciple:ContentAccuracyOverStructure
Thisisthemostimportantguidingprinciple:
\-Whentherearestructuraldifferences\(e\.g\.,CGThasonemergedentryvscandidatehas
multiplesplitentries\),prioritizejudgingwhetherthe\*\*totalinformationcontent\*\*is
equivalent\.
\-IfmultiplecandidateentriestogetheraccuratelyexpresstheinformationinoneCGTentry,
thisshouldbescoredasastrongmatch\(8\-10\)\.
\-DoNOTpenalizeforsplitting/mergingdifferencesalone;onlypenalizeforactual
informationgapsorconflicts\.
\-\-\-
\#\#ScoringRubric\(1\-5Scale\)
\#\#\#ForCGTEntryEvaluation\(HowwelliseachCGTfactcapturedbythecandidate?\)
\|Score\|Category\|Description\|
\|\-\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\-\-\-\-\|
\|\*\*5\*\*\|FULL\_MATCH\|Perfectornear\-perfectmatch\.Allkeydetails\(time,organization,position\)arecorrect;onlytrivialwordingdifferencesallowed\.\|
\|\*\*4\*\*\|PARTIAL\_MATCH\|Goodmatchwithsmallgapsorsimplifications\(e\.g\.,missingenddate,simplifiedorganizationname\)butthecorefactisaccurate\.\|
\|\*\*3\*\*\|PARTIAL\_MATCH\|Partialmatch\.Thesameeventisreferencedbutwithsignificantgapsorminorerrors\.\|
\|\*\*2\*\*\|WEAK\_MATCH\|Veryweak/unclearmatch\.Onlylooselyrelatedcontent;mostdetailsaremissingorwrong\.\|
\|\*\*1\*\*\|NO\_MATCH\|Nomatchatall\.TheCGTfactiscompletelyabsentfromthecandidatebiography\.\|
\#\#\#ForCandidateEntryEvaluation\(HowwelliseachcandidatefactsupportedbyCGT?\)
\|Score\|Category\|Description\|
\|\-\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\-\-\-\-\|
\|\*\*5\*\*\|FULLY\_SUPPORTED\|FullyoralmostfullysupportedbyCGT\.ClearmatchingCGTentrywithatmosttrivialdifferences\.\|
\|\*\*4\*\*\|PARTIALLY\_SUPPORTED\|Mostlysupported\.CorefactisinCGT,withsmalladditionsorwordingdifferences\.\|
\|\*\*3\*\*\|PARTIALLY\_SUPPORTED\|Partiallysupported\.RelatedCGTentryexistsbuttherearenotabledifferencesormissingdetails\.\|
\|\*\*2\*\*\|WEAKLY\_SUPPORTED\|Weaklysupported\.OnlylooselyrelatedCGTcontent;candidatemaycontainerrors\.\|
\|\*\*1\*\*\|NOT\_SUPPORTED\|Nosupport\(hallucination\)\.ThiscandidateentryhasnorealbasisintheCGT\.\|
\-\-\-
\#\#DifferenceCodes\(forCGTevaluationswithscore<5\)
WhenaCGTentryisnotperfectlymatched,selectapplicablecodesfrom:
\|Code\|Meaning\|
\|\-\-\-\-\-\-\|\-\-\-\-\-\-\-\-\-\|
\|‘TIME\_YEAR‘\|Yearisincorrect\|
\|‘TIME\_MISSING‘\|Timeinformationismissingfromcandidate\|
\|‘ORG\_WRONG‘\|Organizationnameisincorrect\(notjustabbreviation\)\|
\|‘POSITION\_WRONG‘\|Position/titleisincorrect\|
\|‘POSITION\_INCOMPLETE‘\|Missingconcurrentpositionsorpartialtitle\|
\|‘EXTRA\_INFO‘\|CandidatehasextrainformationnotinCGT\(neutral/positive\)\|
\-\-\-
\#\#FlexibleAlignmentRules
\#\#\#1\-to\-NMatching\(CGTmergedentryvsCandidatesplitentries\)
\-IfthecandidatesplitsoneCGTentryintomultiplelines,listallmatchingcandidate
entriesseparatedby"\|\|"inthe‘matched\_candidate\_entries‘field\.
\-Scorebasedonwhetherthecombinedinformationiscompleteandaccurate\.
\#\#\#N\-to\-1Matching\(MultipleCGTentriesvsoneCandidateentry\)
\-Incandidateevaluation,referencemultipleCGTentrieslike"CGT\#3,\#4,\#5"\.
\-Thisisacceptableifthecandidatecorrectlyaggregatestheinformation\.
\#\#\#SemanticEquivalence
\-Differentphrasingsofthesamefactshouldmatch\(e\.g\.,"Mayor"="CityMayor"\)\.
\-Abbreviationsvsfullnamesareacceptable\(e\.g\.,"EPA"="EnvironmentalProtectionAgency"\)\.
\-Cross\-languagetranslationsareequivalentifsemanticallythesame\.
\-\-\-
\#\#ImportantNotes
1\.\*\*Beconsistent\*\*:Applythesamestandardsacrossallentries\.
2\.\*\*Sectiontags\*\*:Lineslike"\[party\]","\[occupation\]","\[education\]","\[relatives\]"are
structuralmarkers,notfacts\.Skipthemwhencountingentries\.
3\.\*\*Emptylines\*\*:Ignoreemptylineswhencountingandevaluating\.
4\.currentdateis2025\-11\-25
\-\-\-
\#\#InputData
\#\#\#CGTBIOGRAPHY\(GroundTruth\):
‘‘‘text
\{cgt\_biography\}
‘‘‘
\#\#\#CANDIDATEBIOGRAPHY\(Experiment:\{experiment\_type\}\):
‘‘‘text
\{experiment\_biography\}
‘‘‘
\-\-\-
\#\#OutputFormat
ProduceaJSONobjectwithexactlythesefields:
\-‘official\_id‘:string\(copyfrominput:"\{official\_id\}"\)
\-‘official\_name‘:string\(copyfrominput:"\{official\_name\}"\)
\-‘experiment\_type‘:string\(copyfrominput:"\{experiment\_type\}"\)
\-‘cgt\_entry\_count‘:integer\(numberofnon\-empty,non\-taglinesinCGT\)
\-‘candidate\_entry\_count‘:integer\(numberofnon\-empty,non\-taglinesincandidate\)
\-‘cgt\_evaluations‘:arrayofobjects,oneperCGTentry,eachwith:
\-‘index‘:integer\(1\-based\)
\-‘cgt\_entry\_text‘:string\(theCGTline\)
\-‘matched\_candidate\_entries‘:string\(matchingcandidatetextor"NO\_MATCH"\)
\-‘match\_score‘:integer\(1\-5\)
\-‘match\_category‘:string\("FULL\_MATCH","PARTIAL\_MATCH","WEAK\_MATCH",or"NO\_MATCH"\)
\-‘difference\_codes‘:arrayofstrings\(codesfromthetableabove,orempty\)
\-‘notes‘:string\(briefexplanation\)
\-‘candidate\_evaluations‘:arrayofobjects,onepercandidateentry,eachwith:
\-‘index‘:integer\(1\-based\)
\-‘candidate\_entry\_text‘:string\(thecandidateline\)
\-‘matched\_cgt\_entries‘:string\(e\.g\.,"CGT\#3"or"CGT\#1,\#2"or"NO\_SUPPORT"\)
\-‘support\_score‘:integer\(1\-5\)
\-‘support\_category‘:string\("FULLY\_SUPPORTED","PARTIALLY\_SUPPORTED","WEAKLY\_SUPPORTED",or"NOT\_SUPPORTED"\)
\-‘notes‘:string\(briefexplanation\)
\-‘qualitative\_summary‘:string\(2\-4sentencesonoverallquality\)
DonotincludemarkdownfencesoranytextoutsidetheJSONobject\.Similar Articles
FACTS Benchmark Suite: Systematically evaluating the factuality of large language models
Google DeepMind and Kaggle have launched the FACTS Benchmark Suite, a comprehensive set of evaluations including parametric, search, multimodal, and grounding benchmarks to systematically measure the factuality of large language models.
LLM-based Detection of Manipulative Political Narratives
A computational framework combining prompt-based filtering and unsupervised clustering to identify manipulative political narrative clusters from social media posts without predefined categories.
Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
RADAR introduces a role-anchored multi-agent debate framework where Politician and Scientist agents adversarially reason over evidence to detect misleading half-truths, outperforming baselines on omission-aware fact verification.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.
From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
This paper introduces PrimeFacts, a methodology and resource for extracting fine-grained evidence from fact-checking articles using large language models. The extracted premises improve evidence retrieval and claim verification performance by up to 30% in MRR and 10-20 points in Macro-F1.