Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

arXiv cs.LG Papers

Summary

This paper formalizes Streaming Knowledge Compilation for LLM wikis, introducing a materiality signal to proactively pin important documents from a streaming corpus under a token budget. It proves an O(√(T log K)) regret bound and validates the approach in finance and Wikipedia domains, showing that regret analysis is a reliable evaluation metric.

arXiv:2606.09877v1 Announce Type: new Abstract: LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $\phi_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|\phi_t-\hat\phi_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $\phi_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $\phi_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:15 AM

# Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis
Source: [https://arxiv.org/html/2606.09877](https://arxiv.org/html/2606.09877)
Juan M\. Huerta Zinnia Tech Solutions 600 Steamboat Road Greenwich, CT 06830, USA juan\.huerta@zinnia\.com

###### Abstract

LLM wiki systems compile knowledge into pre\-filled KV caches for efficient inference, but assume a static corpus—an assumption that fails whenever the underlying information landscape evolves continuously\. We formalize*Streaming Knowledge Compilation*: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight\. The enabling insight is a*materiality signal*ϕt​\(k,n\)∈\[0,1\]\\phi\_\{t\}\(k,n\)\\in\[0,1\]that scores the informational importance of documentnnfor entitykkat timettand acts as a surrogate for query relevance, permitting proactive pinning before queries arrive; we prove anO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)regret bound where the prediction\-error termε=𝔼​\[\|ϕt−ϕ^t\|\]\\varepsilon=\\mathbb\{E\}\[\|\\phi\_\{t\}\-\\hat\{\\phi\}\_\{t\}\|\]is the only domain\-specific quantity\. We instantiate the framework in two domains:*finance*, whereϕt\\phi\_\{t\}is abnormal stock volatility predicted by a frozen Llama 3\.1 8B classification head \(AUROC=0\.728=0\.728on 76K articles, strict temporal split;1\.49×1\.49\\timeshigher realized forward volatility for predicted\-material articles\); and*Wikipedia*, whereϕt\\phi\_\{t\}is the Abnormal Edit Ratio \(AER\), a cross\-sectionally normalized edit velocity—demonstrating that the same algorithm generalizes to non\-financial streaming corpora under a different signal\. End\-to\-end QA evaluation on 173 matched pairs \(finance\) and 119 matched pairs \(Wikipedia\) reveals a pervasive LLM\-as\-judge confound on post\-training facts, establishing that regret analysis—not absolute QA scores—is the reliable evaluation metric for compiled knowledge systems\. Finance cumulative regret converges to−20\.0\-20\.0\(−0\.12\-0\.12/step\); Wikipedia regret is\+16\.0\+16\.0\(\+0\.13\+0\.13/step\), with positive sign confirming that Wikipedia edit content is genuinely post\-training—richer context consistently improves scores \(No Wiki 3\.80 vs\. Oracle 4\.74\)—eliminating the confound present in the finance evaluation\. The predictive CEGAR formalization andO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)guarantee apply to any domain where knowledge gaps can be predicted from streaming signals\.

## 1Introduction

The emergence of LLM wiki systems\(Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1); Chanet al\.,[2025](https://arxiv.org/html/2606.09877#bib.bib5)\)has established a compelling alternative to retrieval\-augmented generation \(RAG\)\(Lewiset al\.,[2020](https://arxiv.org/html/2606.09877#bib.bib6)\): rather than retrieving documents at query time, one*compiles*a corpus into a structured wiki and pre\-fills the LLM’s KV cache, enabling fast, grounded inference without retrieval latency\. The WiCER algorithm\(Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1)\)formalizes this compilation through an iterative*Compile–Evaluate–Refine*loop inspired by counterexample\-guided abstraction refinement \(CEGAR\)\(Clarkeet al\.,[2000](https://arxiv.org/html/2606.09877#bib.bib7)\): compile a wiki, evaluate it against probe questions, diagnose missing facts,*pin*them, and recompile\.

This framework implicitly assumes a*static*underlying corpus\. In practice, this assumption fails across a wide range of high\-value domains\. Consider a financial analyst wiki compiled on a Friday evening for a portfolio of 50 stocks: by Monday morning, a DOJ antitrust investigation may have been announced, an earnings surprise reported, or a CEO departed\. Consider a medical knowledge base for clinical decision support: a drug recall, trial result, or updated dosing guideline may arrive overnight\. Consider a Wikipedia\-backed QA system: the pages for major AI companies or geopolitical events may receive hundreds of edits in a single day around a breaking development\. In each case, the wiki is stale before it is ever queried, and the fundamental challenge is not merely*updating*it—it is deciding*which*of the hundreds of daily documents warrant incorporation within a fixed token budget\.

Our key insight is that the CEGAR “counterexample” need not be discovered reactively through QA failures\. A*materiality signal*ϕt​\(k,n\)∈\[0,1\]\\phi\_\{t\}\(k,n\)\\in\[0,1\]—scoring the informational importance of documentnnfor entitykkat timett—can serve as a proxy for query relevance, enabling proactive pinning before any query exposes a knowledge gap\. What this signal is concretely is domain\-specific: in finance, it is abnormal stock volatility; in Wikipedia, it is abnormal edit velocity\. But the algorithm, theory, and guarantees are entirely agnostic to this choice\. This transforms the WiCER refinement loop from reactive diagnosis to*proactive, prediction\-driven*knowledge maintenance\.

We introduce*Online WiCER*, an algorithm that operates on a continuous news stream:

1. 1\.Mine: Extract candidate facts from incoming news articles\.
2. 2\.Score: Assess each fact’s*marginal*value using a state\-aware scorer: a regression head on the same frozen backbone that conditions on the current pin set, estimating the expected regret reduction of each candidate given what is already pinned\.
3. 3\.Pin: Greedily select facts that maximise marginal regret reduction under the token budget, evicting stale pins via a decay\-weighted priority queue\.
4. 4\.Compile: Perform incremental wiki updates daily and full WiCER recompilation everyTrT\_\{r\}steps\.

##### Contributions\.

1. 1\.Streaming Knowledge Compilationformalized as a budget\-constrained online optimization problem: maintain a compiled wiki against a streaming corpus, minimize regret against an oracle with perfect foresight, and produce proactive pinning decisions from a domain\-specific materiality signalϕt\\phi\_\{t\}\(§[3](https://arxiv.org/html/2606.09877#S3)\)\.
2. 2\.TheOnline WiCER algorithm: proactive materiality\-scored pinning, decay\-weighted eviction, and periodic WiCER recompilation, with formal convergence guarantees \(§[4](https://arxiv.org/html/2606.09877#S4)\)\.
3. 3\.Astate\-aware marginal regret scorer: a regression head on a frozen backbone that conditions on the current pin set and enables a greedy pin selection rule with a\(1−1/e\)\(1\-1/e\)submodular approximation guarantee \(§[4\.3](https://arxiv.org/html/2606.09877#S4.SS3)\)\.
4. 4\.Aregret decomposition theoremprovingO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)cumulative regret where the only domain\-specific term is the prediction errorε=𝔼​\[\|ϕt−ϕ^t\|\]\\varepsilon=\\mathbb\{E\}\[\|\\phi\_\{t\}\-\\hat\{\\phi\}\_\{t\}\|\], establishing that the framework applies to any bounded materiality signal \(§[5](https://arxiv.org/html/2606.09877#S5)\)\.
5. 5\.A formalization ofpredictive CEGAR, extending the reactive CEGAR paradigm to proactive, prediction\-driven refinement, applicable to any domain where knowledge gaps can be predicted from streaming signals \(§[4\.7](https://arxiv.org/html/2606.09877#S4.SS7)\)\.
6. 6\.Afinance instantiation: abnormal stock volatility asϕt\\phi\_\{t\}, scored by a frozen Llama 3\.1 8B classification head \(AUROC = 0\.728, strict temporal split,1\.49×1\.49\\timesrealized volatility ratio for predicted\-material articles\); cumulative regret over 173 matched pairs converges to−20\.0\-20\.0\(mean−0\.12\-0\.12/step\) \(§[6\.1](https://arxiv.org/html/2606.09877#S6.SS1)\)\.
7. 7\.AWikipedia instantiation: Abnormal Edit Ratio \(AER\) asϕt\\phi\_\{t\}, demonstrating that the identical algorithm achieves sub\-linear regret on a non\-financial, publicly available streaming corpus \(§[6\.2](https://arxiv.org/html/2606.09877#S6.SS2)\)\.
8. 8\.Amethodological finding: LLM\-as\-judge evaluation is confounded on post\-training facts; regret analysis on matched pairs is the reliable metric for compiled knowledge systems where the backbone’s parametric memory is a confound \(§[7\.1\.8](https://arxiv.org/html/2606.09877#S7.SS1.SSS8)\)\.

## 2Related Work

##### Knowledge compilation and LLM wikis\.

RAG\(Lewiset al\.,[2020](https://arxiv.org/html/2606.09877#bib.bib6)\)retrieves relevant documents at query time, incurring latency and retrieval noise\. Cache\-augmented generation \(CAG\)\(Chanet al\.,[2025](https://arxiv.org/html/2606.09877#bib.bib5)\)and the LLM wiki pattern pre\-fill the KV cache with compiled knowledge, trading compilation cost for inference speed\. RAPTOR\(Sarthiet al\.,[2024](https://arxiv.org/html/2606.09877#bib.bib3)\)builds hierarchical summaries; GraphRAG\(Edgeet al\.,[2024](https://arxiv.org/html/2606.09877#bib.bib4)\)constructs knowledge graphs\. WiCER\(Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1)\)introduces iterative refinement via CEGAR\-inspired pinning\. All assume a static corpus\. We extend WiCER to the streaming setting\.

##### Financial NLP and LLM\-based prediction\.

FinBERT\(Araci,[2019](https://arxiv.org/html/2606.09877#bib.bib10)\)adapts BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.09877#bib.bib69)\)for financial sentiment; BloombergGPT\(Wuet al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib11)\)trains a domain\-specific LLM on financial data\. Recent work explores LLMs for stock prediction\(Lopez\-Lira and Tang,[2023](https://arxiv.org/html/2606.09877#bib.bib13); Xieet al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib12)\)and financial instruction tuning\(Zhanget al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib14); Yanget al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib15)\)\. LLM embeddings from the Llama family predict cross\-sectional returns, outperforming word\-embedding baselines\(Chenet al\.,[2022](https://arxiv.org/html/2606.09877#bib.bib49)\); decoder LLMs with lightweight classification heads prove superior to encoder models for large stock universes\(Guo and Hauptmann,[2024](https://arxiv.org/html/2606.09877#bib.bib50)\)\. Temporal validity is a growing concern:Heet al\.\([2025](https://arxiv.org/html/2606.09877#bib.bib51)\)demonstrate that standard pretrained LLMs encode future information, motivating our strict temporal train/test split \(§[6](https://arxiv.org/html/2606.09877#S6)\)\.Liet al\.\([2024](https://arxiv.org/html/2606.09877#bib.bib57)\)score news along five axes—including event significance and price\-impact duration—via an LLM\-based denoised encoder;Wanget al\.\([2024b](https://arxiv.org/html/2606.09877#bib.bib58)\)iteratively filter and align news with time\-series fluctuations via LLM reflection, a loop structurally analogous to our streaming pinning cycle\. Conversely,Tanet al\.\([2024](https://arxiv.org/html/2606.09877#bib.bib60)\)show that naive LLM substitution for time\-series forecasting does not improve performance—underscoring our design choice of using the LLM only for text materiality scoring, not for price dynamics\. We leverage NLP signal not for trading but for*knowledge curation*—predicting which news items will cause abnormal volatility and therefore warrant wiki inclusion\. Rather than using a separate encoder model for classification, we train a lightweight classification head on top of the same frozen LLM used for wiki compilation, yielding a unified single\-model architecture\.

##### Formal quantitative models of volatility\.

Volatility has a foundational role in quantitative finance that motivates using it as a materiality signal\.Engle \([1982](https://arxiv.org/html/2606.09877#bib.bib16)\)established that return variance is time\-varying and autoforecastable \(ARCH models; Nobel Prize in Economics 2003\);Bollerslev \([1986](https://arxiv.org/html/2606.09877#bib.bib17)\)generalized this to GARCH, now the standard engine for derivatives pricing and market\-risk management\.Black and Scholes \([1973](https://arxiv.org/html/2606.09877#bib.bib18)\)showed that volatility is the sole unobservable input in options pricing—making its prediction directly monetizable—and stochastic volatility models\(Heston,[1993](https://arxiv.org/html/2606.09877#bib.bib19)\)extend this to a latent variance process, further amplifying demand for accurate forecasts\. The realized volatility framework\(Andersen and Bollerslev,[1998](https://arxiv.org/html/2606.09877#bib.bib20); Barndorff\-Nielsen and Shephard,[2002](https://arxiv.org/html/2606.09877#bib.bib21)\)grounds volatility estimation in model\-free measures computed from high\-frequency OHLCV price data; this is precisely how we compute 5\-day forward realized volatility from Yahoo Finance data\.Corsi \([2009](https://arxiv.org/html/2606.09877#bib.bib22)\)proposes the HAR model, capturing the multi\-scale \(daily, weekly, monthly\) persistence of realized volatility; ML models substantially outperform HAR for this task\(Christensenet al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib24)\), motivating our classification head over autoregressive baselines\. Our abnormal volatility ratio \(AVR, Eq\.[1](https://arxiv.org/html/2606.09877#S3.E1)\) inherits this realized\-volatility foundation and applies cross\-sectional normalization to isolate firm\-specific information events from market\-wide moves, connecting the high\-frequency estimation literature\(Boudoukhet al\.,[2019](https://arxiv.org/html/2606.09877#bib.bib23)\)to our knowledge\-curation application\.

##### News\-driven volatility prediction\.

A growing literature establishes that news text predicts stock*volatility*more reliably than price direction\.Atkinset al\.\([2018](https://arxiv.org/html/2606.09877#bib.bib26)\)directly demonstrate 56% accuracy for volatility prediction vs\. 49% \(chance\) for price direction from financial news\.Glasserman and Mamaysky \([2019](https://arxiv.org/html/2606.09877#bib.bib27)\)show that*unusual*news content—measured by information\-theoretic divergence—forecasts elevated firm\-specific and aggregate volatility months ahead\.Manela and Moreira \([2017](https://arxiv.org/html/2606.09877#bib.bib28)\)construct a text\-based volatility index \(NVIX\) from Wall Street Journal front pages spanning 1890–2009\.Bodilsen and Lunde \([2025](https://arxiv.org/html/2606.09877#bib.bib30)\)augment HAR models with news sentiment, achieving large improvements at multi\-day horizons\.Xinget al\.\([2019](https://arxiv.org/html/2606.09877#bib.bib34)\)propose a sentiment\-aware volatility model using variational Bayes\. At the macro level,Bakeret al\.\([2016](https://arxiv.org/html/2606.09877#bib.bib29)\)build news\-based policy uncertainty indices that predict market volatility, andBybeeet al\.\([2024](https://arxiv.org/html/2606.09877#bib.bib32)\)analyze 800K WSJ articles via topic modeling to forecast business cycles and market dynamics\.Boudoukhet al\.\([2019](https://arxiv.org/html/2606.09877#bib.bib23)\)decompose firm\-specific news into fundamental vs\. non\-fundamental categories, finding that fundamental news explains nearly half of overnight idiosyncratic realized volatility—directly motivating the materiality\-filtering design we adopt\. On the forecasting side, ML models significantly outperform the HAR family for realized volatility\(Christensenet al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib24)\), and cross\-sectional pooling of stock data further improves neural forecasts\(Zhanget al\.,[2024](https://arxiv.org/html/2606.09877#bib.bib25)\); our cross\-sectional abnormal volatility definition \(Eq\.[1](https://arxiv.org/html/2606.09877#S3.E1)\) exploits the same cross\-sectional structure\. Our work nonetheless differs from volatility*forecasting*: we use abnormal volatility as a*pinning signal*for knowledge curation, not as a trading target\.

##### Volatility as an information signal\.

The SEC defines information as “material” if a reasonable investor would consider it important\. Empirically, abnormal stock volatility is a robust proxy for information relevance\(Khanet al\.,[2016](https://arxiv.org/html/2606.09877#bib.bib39); Grewalet al\.,[2021](https://arxiv.org/html/2606.09877#bib.bib40)\): events that cause a stock to move significantly more than the market indicate the arrival of firm\-specific information\. Text\-derived risk indices also move markets at the macro level:Hassanet al\.\([2019](https://arxiv.org/html/2606.09877#bib.bib35)\)measure firm\-level political risk from earnings call transcripts, showing idiosyncratic text signals cause measurable investment retrenchment;Caldara and Iacoviello \([2022](https://arxiv.org/html/2606.09877#bib.bib36)\)causally link a newspaper\-derived geopolitical risk index to VIX spikes\. Most closely related to our volatility scorer,Zhaoet al\.\([2025](https://arxiv.org/html/2606.09877#bib.bib37)\)train a classification head on frozen LLM representations of 10\-K filings to predict firm\-specific idiosyncratic volatility, whileCaoet al\.\([2024](https://arxiv.org/html/2606.09877#bib.bib38)\)jointly forecast volatility and Value\-at\-Risk from multimodal earnings call signals \(text, audio, and time series\)\. We extend these ideas to the*daily news stream*: our scorer operates in real time at news arrival, targeting abnormal volatility relative to the cross\-sectional market average, and feeds this signal into a streaming knowledge\-curation system rather than a trading model\. We adopt an abnormal volatility definition: a news item is a high\-volatility event for entitykkif it is associated with*abnormal realized volatility*relative to the cross\-sectional market average over the subsequent 5 trading days\. This direction\-agnostic definition captures both positive and negative surprises while filtering out broad market moves\.

##### Online learning and caching\.

The experts problem\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.09877#bib.bib42); Hazan,[2016](https://arxiv.org/html/2606.09877#bib.bib43)\)provides regret bounds for sequential decision\-making\. Competitive analysis of paging algorithms\(Fiatet al\.,[1991](https://arxiv.org/html/2606.09877#bib.bib46)\)and weighted paging with predictions\(Bansalet al\.,[2012](https://arxiv.org/html/2606.09877#bib.bib47)\)inform our pin eviction strategy\. Online WiCER’s pin management reduces to a weighted paging problem with learned predictions\. In the LLM memory literature,Wanget al\.\([2024a](https://arxiv.org/html/2606.09877#bib.bib59)\)address lifelong knowledge editing via a dual\-memory scheme \(main parametric memory \+ side memory for new edits, with a learned router\) that resolves the reliability–generalization–locality tension; our wiki plays the role of the side memory, with volatility scoring acting as the router that decides what enters it\.

##### CEGAR extensions\.

Lazy abstraction\(Henzingeret al\.,[2002](https://arxiv.org/html/2606.09877#bib.bib8); McMillan,[2006](https://arxiv.org/html/2606.09877#bib.bib9)\)refines abstractions incrementally rather than globally\. Our predictive CEGAR extends this: rather than waiting for a counterexample \(failed QA probe\), we*predict*likely counterexamples from the news stream and preemptively refine\.

##### Streaming QA and temporal knowledge\.

StreamingQA\(Liskaet al\.,[2022](https://arxiv.org/html/2606.09877#bib.bib64)\)and RealTimeQA\(Kasaiet al\.,[2023](https://arxiv.org/html/2606.09877#bib.bib63)\)benchmark models on temporally evolving knowledge\.Choiet al\.\([2025](https://arxiv.org/html/2606.09877#bib.bib61)\)introduce FinAgentBench, a benchmark specifically for agentic retrieval in financial QA, which provides an evaluation framework directly applicable to assessing our wiki’s downstream utility\. These works focus on*evaluating*temporal knowledge; we focus on*maintaining*it proactively within a compiled wiki\.

## 3Problem Formulation

We consider a discrete\-time settingt=1,2,…,Tt=1,2,\\ldots,Twith a universe ofKKentities \(e\.g\., companies, Wikipedia articles, or any other knowledge\-bearing units\)\. At each time step:

1. 1\.A batch of news articles𝒩t=\{n1,…,n\|𝒩t\|\}\\mathcal\{N\}\_\{t\}=\\\{n\_\{1\},\\ldots,n\_\{\|\\mathcal\{N\}\_\{t\}\|\}\\\}arrives\.
2. 2\.A set of queriesQt=\{q1,…,q\|Qt\|\}Q\_\{t\}=\\\{q\_\{1\},\\ldots,q\_\{\|Q\_\{t\}\|\}\\\}is posed against the wiki\.
3. 3\.The wiki𝒲t\\mathcal\{W\}\_\{t\}serves answers using its pre\-filled KV cache\.

##### Wiki budget\.

The wiki𝒲t\\mathcal\{W\}\_\{t\}has a total token budgetBB\. Of this, a fractionBpinB\_\{\\text\{pin\}\}is reserved for dynamically pinned facts; the remainderB−BpinB\-B\_\{\\text\{pin\}\}holds the base compiled wiki\.

##### Materiality signal\.

The framework requires a*materiality signal*ϕt​\(k,n\)∈\[0,1\]\\phi\_\{t\}\(k,n\)\\in\[0,1\]that scores the informational importance of news articlennfor entitykkat timett\. The theory holds for any bounded signal; the algorithm instantiatesϕt\\phi\_\{t\}as a trained predictor of this signal\. Pinning decisions at stepttuse the predictedϕ^t\\hat\{\\phi\}\_\{t\}rather than the trueϕt\\phi\_\{t\}, and the regret bound absorbs the prediction errorε=𝔼​\[\|ϕt−ϕ^t\|\]\\varepsilon=\\mathbb\{E\}\[\|\\phi\_\{t\}\-\\hat\{\\phi\}\_\{t\}\|\]\(Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\)\.

*Instantiation \(this work\)\.*We instantiateϕt\\phi\_\{t\}as the*abnormal volatility indicator*: for entitykkand articlennarriving at timett, define forward realized volatilityRVk,t=std​\(rk,t\+1,…,rk,t\+5\)\\text\{RV\}\_\{k,t\}=\\text\{std\}\(r\_\{k,t\+1\},\\ldots,r\_\{k,t\+5\}\)and cross\-sectional averageRV¯t=1K​∑k=1KRVk,t\\overline\{\\text\{RV\}\}\_\{t\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\text\{RV\}\_\{k,t\}\. The abnormal volatility ratio and indicator are:

AVR​\(k,t\)=RVk,tRV¯t,𝒱​\(k,n\)=𝟏​\[AVR​\(k,t\)\>2\]\\text\{AVR\}\(k,t\)=\\frac\{\\text\{RV\}\_\{k,t\}\}\{\\overline\{\\text\{RV\}\}\_\{t\}\},\\qquad\\mathcal\{V\}\(k,n\)=\\mathbf\{1\}\\\!\\left\[\\text\{AVR\}\(k,t\)\>2\\right\]\(1\)This definition is direction\-agnostic \(capturing both positive and negative surprises\) and market\-relative \(filtering out broad moves that affect all stocks equally\)\.

*Other instantiations\.*The same framework accommodates any materiality signal, including earnings surprise, credit\-spread widening, sentiment shift, or regulatory\-attention scores\. Empirical validation of alternativeϕt\\phi\_\{t\}instantiations is left to future work; the regret bound and algorithmic structure are signal\-agnostic\.

##### Pins\.

A pin is a tuple\(f,k,t,s\)\(f,k,t,s\): a factffextracted from news, associated with entitykk, timestamped attt, with materiality scores=ϕ^t​\(k,n\)∈\[0,1\]s=\\hat\{\\phi\}\_\{t\}\(k,n\)\\in\[0,1\]\. Each pin consumes\|f\|\|f\|tokens fromBpinB\_\{\\text\{pin\}\}\.

##### Query set decomposition\.

At each step the query setQtQ\_\{t\}has two components with different origins and evaluation roles:

Qt=Qtbg⏟standing queries∪Qtvol⏟event\-driven queriesQ\_\{t\}\\;=\\;\\underbrace\{Q\_\{t\}^\{\\textup\{bg\}\}\}\_\{\\text\{standing queries\}\}\\;\\cup\\;\\underbrace\{Q\_\{t\}^\{\\textup\{vol\}\}\}\_\{\\text\{event\-driven queries\}\}\(2\)QtbgQ\_\{t\}^\{\\textup\{bg\}\}contains*standing*analyst questions—slow\-changing portfolio\-level queries \(e\.g\., credit\-risk profiles, sector comparisons\) that are relevant regardless of today’s news\. These are set at initialisation and refreshed at coarse intervals; their gold answers come from the accumulated corpus up to timett\.QtvolQ\_\{t\}^\{\\textup\{vol\}\}contains*event\-driven*questions generated at timettfrom the newly pinned factsΔ​𝒫t\\Delta\\mathcal\{P\}\_\{t\}via theQueryGensubroutine \(§[4\.6](https://arxiv.org/html/2606.09877#S4.SS6)\): for each pinned fact\(f,k\)\(f,k\), the LLM generates one specific question whose gold answer is the factffitself\. BecauseQtvolQ\_\{t\}^\{\\textup\{vol\}\}is produced after pinning completes, pinning decisions at stepttstill cannot observeQtvolQ\_\{t\}^\{\\textup\{vol\}\}—the causal constraint of the online setting is preserved\.

##### Quality metric\.

Let𝒬​\(𝒲t,q\)∈\[0,1\]\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\)\\in\[0,1\]denote the quality of wiki𝒲t\\mathcal\{W\}\_\{t\}on queryqq\. The specific instantiation of𝒬\\mathcal\{Q\}is domain\-dependent; the theory holds for any bounded quality function\. Define the*offline oracle*wiki𝒲t∗\\mathcal\{W\}\_\{t\}^\{\*\}as the wiki compiled with perfect foresight of which documents are high\-materiality events \(i\.e\.,ϕt​\(k,n\)=1\\phi\_\{t\}\(k,n\)=1\)\. The cumulative regret decomposes naturally over the two query populations:

Regret​\(T\)=∑t=1T∑q∈Qtbg\[𝒬​\(𝒲t∗,q\)−𝒬​\(𝒲t,q\)\]⏟standing\-query regret\+∑t=1T∑q∈Qtvol\[𝒬​\(𝒲t∗,q\)−𝒬​\(𝒲t,q\)\]⏟event\-query regret\\text\{Regret\}\(T\)=\\underbrace\{\\sum\_\{t=1\}^\{T\}\\sum\_\{q\\in Q\_\{t\}^\{\\textup\{bg\}\}\}\\bigl\[\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\}^\{\*\},q\)\-\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\)\\bigr\]\}\_\{\\text\{standing\-query regret\}\}\+\\underbrace\{\\sum\_\{t=1\}^\{T\}\\sum\_\{q\\in Q\_\{t\}^\{\\textup\{vol\}\}\}\\bigl\[\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\}^\{\*\},q\)\-\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\)\\bigr\]\}\_\{\\text\{event\-query regret\}\}\(3\)Standing\-query regret measures how well the base wiki answers persistent questions; event\-query regret measures how well the pin layer answers questions about the events it just pinned\.

##### The role of queries\.

Queries play two distinct roles in Online WiCER that are important to separate\. First, they are the*measurement instrument*in the theoretical analysis:𝒬​\(𝒲t,q\)\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\)defines what “wiki quality” means, and the regret in Equation \([3](https://arxiv.org/html/2606.09877#S3.E3)\) is the sum of per\-query quality gaps\. Second, they are the*downstream consumer*of the compiled wiki at inference time \(Algorithm[1](https://arxiv.org/html/2606.09877#alg1), final step\)\.

Crucially, queries do*not*drive the pinning decisions\. At steptt, the news batch𝒩t\\mathcal\{N\}\_\{t\}arrives and pinning decisions must be made*before*QtQ\_\{t\}is observed—this is the defining constraint of the online setting\. The algorithm therefore cannot condition on queries when deciding what to pin\. Instead, it uses a predicted*materiality signal*ϕ^t\\hat\{\\phi\}\_\{t\}as a proxy for query relevance, under the assumption that high\-materiality facts are the facts analysts will subsequently query about\. This substitution is the central modelling choice of Online WiCER: it converts an intractable query\-conditioned pinning problem into a tractable prediction problem, at the cost of a prediction error termε=𝔼​\[\|ϕt−ϕ^t\|\]\\varepsilon=\\mathbb\{E\}\[\|\\phi\_\{t\}\-\\hat\{\\phi\}\_\{t\}\|\]in the regret bound \(Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\)\. In this workϕ^t\\hat\{\\phi\}\_\{t\}is instantiated as an abnormal volatility predictor; the framework is agnostic to this choice\.

We partially close this gap by makingQtvolQ\_\{t\}^\{\\textup\{vol\}\}*endogenous*: the same high\-volatility facts that are pinned generate specific questions about those events \(§[4\.6](https://arxiv.org/html/2606.09877#S4.SS6)\)\. The full decompositionQt=Qtbg∪QtvolQ\_\{t\}=Q\_\{t\}^\{\\textup\{bg\}\}\\cup Q\_\{t\}^\{\\textup\{vol\}\}is defined in Equation \([2](https://arxiv.org/html/2606.09877#S3.E2)\)\.

##### Objective\.

Design an online algorithm that minimizesRegret​\(T\)\\text\{Regret\}\(T\)subject to the token budget constraint\|𝒲t\|≤B\|\\mathcal\{W\}\_\{t\}\|\\leq Bfor alltt, using only information available at the time each pinning decision is made\.

## 4Algorithm: Online WiCER

### 4\.1Main Loop

Algorithm[1](https://arxiv.org/html/2606.09877#alg1)presents the Online WiCER procedure\. The algorithm maintains a wiki𝒲t\\mathcal\{W\}\_\{t\}and a priority queue of active pins𝒫t\\mathcal\{P\}\_\{t\}\. At each time step, it processes incoming news, scores candidate facts for volatility impact, selects pins under the budget, and incrementally updates the wiki\. Periodically, a full WiCER recompilation integrates pinned facts into the base wiki\.

Algorithm 1Online WiCER0:Entity set

EE, initial corpus

D0D\_\{0\}, budget

BB, threshold

τ\\tau, decay

λ\\lambda, recompile period

TrT\_\{r\}
1:

W0←BatchWiCER​\(D0,B\)W\_\{0\}\\leftarrow\\textsc\{BatchWiCER\}\(D\_\{0\},B\)
2:

𝒫0←∅\\mathcal\{P\}\_\{0\}\\leftarrow\\emptyset\{Active pin set\}

3:

queue←PriorityQueue​\(\)\\text\{queue\}\\leftarrow\\text\{PriorityQueue\}\(\)
4:for

t=1,2,…,Tt=1,2,\\ldots,Tdo

5:Receive news batch

𝒩t\\mathcal\{N\}\_\{t\}
6:for

n∈𝒩tn\\in\\mathcal\{N\}\_\{t\}do

7:

Cn←ExtractFacts​\(n\)C\_\{n\}\\leftarrow\\textsc\{ExtractFacts\}\(n\)\{Candidate facts111In our implementation,ExtractFactscalls the LLM once per article to extract the key fact as a single sentence; raw headlines may be substituted for speed\.\}

8:for

c∈Cnc\\in C\_\{n\}do

9:

sc←Δ^​Rθ​\(c∣𝒫t−1\)s\_\{c\}\\leftarrow\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\-1\}\)\{§[4\.2](https://arxiv.org/html/2606.09877#S4.SS2),[4\.3](https://arxiv.org/html/2606.09877#S4.SS3)\}

10:if

sc≥τs\_\{c\}\\geq\\tauthen

11:

queue\.push​\(c,sc,t\)\\text\{queue\.push\}\(c,s\_\{c\},t\)
12:endif

13:endfor

14:endfor

15:

𝒫t←PinSelect​\(𝒫t−1,queue,Bpin,λ,t\)\\mathcal\{P\}\_\{t\}\\leftarrow\\textsc\{PinSelect\}\(\\mathcal\{P\}\_\{t\-1\},\\text\{queue\},B\_\{\\text\{pin\}\},\\lambda,t\)\{§[4\.4](https://arxiv.org/html/2606.09877#S4.SS4)\}

16:

Δ​𝒫t←𝒫t∖𝒫t−1\\Delta\\mathcal\{P\}\_\{t\}\\leftarrow\\mathcal\{P\}\_\{t\}\\setminus\\mathcal\{P\}\_\{t\-1\}\{Newly pinned facts this step\}

17:

Wt←IncrementalCompile​\(Wt−1,Δ​𝒫t\)W\_\{t\}\\leftarrow\\textsc\{IncrementalCompile\}\(W\_\{t\-1\},\\Delta\\mathcal\{P\}\_\{t\}\)
18:if

tmodTr=0t\\bmod T\_\{r\}=0then

19:

Wt←BatchWiCER​\(D0∪𝒟recent,B,pins=𝒫t\)W\_\{t\}\\leftarrow\\textsc\{BatchWiCER\}\(D\_\{0\}\\cup\\mathcal\{D\}\_\{\\text\{recent\}\},B,\\text\{pins\}=\\mathcal\{P\}\_\{t\}\)
20:endif

21:

Qtvol←QueryGen​\(Δ​𝒫t\)Q\_\{t\}^\{\\textup\{vol\}\}\\leftarrow\\textsc\{QueryGen\}\(\\Delta\\mathcal\{P\}\_\{t\}\)\{Event\-driven queries \(§[4\.6](https://arxiv.org/html/2606.09877#S4.SS6)\)\}

22:

Qt←Qtbg∪QtvolQ\_\{t\}\\leftarrow Q\_\{t\}^\{\\textup\{bg\}\}\\cup Q\_\{t\}^\{\\textup\{vol\}\}\{Standing queries merged with event\-driven\}

23:Serve each

q∈Qtq\\in Q\_\{t\}against

WtW\_\{t\}
24:endfor

### 4\.2Materiality Scoring

The algorithm requires a predicted materiality scoreϕ^t​\(k,n\)∈\[0,1\]\\hat\{\\phi\}\_\{t\}\(k,n\)\\in\[0,1\]for each incoming document\. The framework is agnostic to how this score is produced; the only requirement is that it is bounded and observable at pinning time\. We describe the architecture used in the finance instantiation \(§[6\.1](https://arxiv.org/html/2606.09877#S6.SS1)\); the Wikipedia instantiation \(§[6\.2](https://arxiv.org/html/2606.09877#S6.SS2)\) uses the AER signal in place of the scoring network\.

##### Finance instantiation: unified frozen\-backbone architecture\.

We employ a*unified*scoring architecture built on a single frozen Llama 3\.1 8B model, with two inference paths that are fused into a final materiality score\.

##### LLM zero\-shot scorer\.

Given a news articlennand entitykk, prompt the LLM generatively to produce a probability estimate that the article will cause abnormal 5\-day realized volatility \(exceeding2×2\\timesthe cross\-sectional market average\)\. The exact prompt is reproduced verbatim in Appendix[B](https://arxiv.org/html/2606.09877#A2)for full reproducibility\. This yieldsszs​\(k,n\)∈\[0,1\]s\_\{\\text\{zs\}\}\(k,n\)\\in\[0,1\]\. The zero\-shot path requires no training data and generalizes to novel event types\.

##### Classification head \(linear probe\)\.

We attach a lightweight classification head \(linear projection \+ sigmoid\) to the last hidden state of the frozen Llama 3\.1 8B backbone\. The head is trained on historical \(news, volatility label\) pairs with binary cross\-entropy loss, producing probability estimatesshead​\(k,n\)∈\[0,1\]s\_\{\\text\{head\}\}\(k,n\)\\in\[0,1\]\. Only the head parameters are trained; the backbone weights are never updated—not even a single gradient step propagates into the LLM\. This is a*linear probe*\(Alain and Bengio,[2017](https://arxiv.org/html/2606.09877#bib.bib48)\): the LLM is used purely as a feature extractor, and the classification head learns a single hyperplane in the backbone’s representation space\. The fact that this linear probe achieves meaningful discrimination \(AUROC = 0\.728, §[6](https://arxiv.org/html/2606.09877#S6)\) implies that abnormal volatility signal is*already linearly encoded*in the pretrained representations, without any task\-specific adaptation of the backbone parameters\.

##### Hybrid fusion\.

The final volatility score combines both inference paths from the same model:

s​\(k,n\)=α⋅szs​\(k,n\)\+\(1−α\)⋅shead​\(k,n\)s\(k,n\)=\\alpha\\cdot s\_\{\\text\{zs\}\}\(k,n\)\+\(1\-\\alpha\)\\cdot s\_\{\\text\{head\}\}\(k,n\)\(4\)whereα∈\[0,1\]\\alpha\\in\[0,1\]is a mixing parameter tuned on validation data\. The zero\-shot path provides broad generalization to novel events, while the classification head offers superior calibration on common patterns\. Crucially, both paths use the same frozen backbone weights, so the marginal cost of the classification head is a single matrix multiply\. This stateless scorer evaluates each candidate*independently*; §[4\.3](https://arxiv.org/html/2606.09877#S4.SS3)extends it to a state\-aware variant that conditions on the current pin set\.

### 4\.3State\-Aware Marginal Regret Scorer

The stateless volatility scorers​\(c\)s\(c\)\(§[4\.2](https://arxiv.org/html/2606.09877#S4.SS2)\) evaluates each candidate fact independently of the current pin set𝒫t\\mathcal\{P\}\_\{t\}\. This creates redundancy: if the pin budget already contains a highly relevant article about NVDA’s supply chain disruption, a second article covering the same event contributes little additional wiki quality—yet both receive equal scores from the stateless scorer\.

We address this by replacings​\(c\)s\(c\)with a*marginal expected regret reduction*: the gain in wiki quality from pinning candidateccgiven what is already in𝒫t\\mathcal\{P\}\_\{t\}\. Define the analytical training target:

Δ​R​\(c∣𝒫t\)=AVR​\(c\)×\(1−maxp∈𝒫t⁡\|ρ​\(rc,rp\)\|\)\\Delta R\(c\\mid\\mathcal\{P\}\_\{t\}\)=\\text\{AVR\}\(c\)\\times\\bigl\(1\-\\max\_\{p\\in\\mathcal\{P\}\_\{t\}\}\|\\rho\(r\_\{c\},r\_\{p\}\)\|\\bigr\)\(5\)whereρ​\(rc,rp\)\\rho\(r\_\{c\},r\_\{p\}\)is the Pearson correlation between the 5\-day forward return series of the candidate’s and pin’s associated entities\. The first factor rewards high\-volatility candidates; the second penalises candidates whose market impact is already captured by a pinned article \(high\|ρ\|\|\\rho\|= correlated market movements = redundant information\)\. When𝒫t=∅\\mathcal\{P\}\_\{t\}=\\emptyset,Δ​R​\(c∣∅\)=AVR​\(c\)\\Delta R\(c\\mid\\emptyset\)=\\text\{AVR\}\(c\), recovering the stateless scorer as a special case\.

##### Architecture: frozen backbone \+ regression head\.

We estimateΔ^​Rθ​\(c∣𝒫t\)\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\}\)using the same frozen\-backbone paradigm as the stateless scorer \(§[4\.2](https://arxiv.org/html/2606.09877#S4.SS2)\): the same Llama 3\.1 8B weights are used for both heads, and no gradient ever flows into the LLM\. The key difference is a richer input prompt that jointly encodes the current pin set and the candidate fact, enabling the backbone’s last\-token representation to integrate redundancy information:

Youareevaluatingwhethertoaddanewsitemtoa

financialknowledgewiki\.

Currentwikipins\(highestpriorityfirst\):

\-\{ticker\_1\}:\{headline\_1\}

\-\{ticker\_2\}:\{headline\_2\}\.\.\.

Candidatefact:

\-\{ticker\_c\}:\{headline\_c\}

RatetheMARGINALVALUE\(0to1\)ofaddingthiscandidate

giventheexistingpins\.Ahighscoremeans:highvolatility

potentialANDcoversentities/eventsnotalreadypinned\.

Alowscoremeans:theentity/eventisalreadywell\-

representedinthecurrentpins\.

\[SCORE\]

The frozen LLM produces a hidden stateh\[Score\]∈ℝdh\_\{\[\\textsc\{Score\}\]\}\\in\\mathbb\{R\}^\{d\}for the final\[SCORE\]token\. A linear regression head,

Δ^​Rθ​\(c∣𝒫t\)=σ​\(w⊤​h\[Score\]\+b\),w∈ℝd,b∈ℝ,\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\}\)=\\sigma\\\!\\left\(w^\{\\top\}h\_\{\[\\textsc\{Score\}\]\}\+b\\right\),\\quad w\\in\\mathbb\{R\}^\{d\},\\;b\\in\\mathbb\{R\},\(6\)is the*only*component trained; all backbone parameters remain frozen\. This mirrors the stateless classification head exactly, except \(i\) the input now encodes context from𝒫t\\mathcal\{P\}\_\{t\}, and \(ii\) the loss is regression rather than classification\. Training minimises MSE:

ℒ​\(θ\)=1\|𝒮\|​∑\(c,𝒫t,Δ​R\)∈𝒮\[Δ^​Rθ​\(c∣𝒫t\)−Δ​R​\(c∣𝒫t\)\]2\\mathcal\{L\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{\(c,\\,\\mathcal\{P\}\_\{t\},\\,\\Delta R\)\\in\\mathcal\{S\}\}\\bigl\[\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\}\)\-\\Delta R\(c\\mid\\mathcal\{P\}\_\{t\}\)\\bigr\]^\{2\}\(7\)where𝒮\\mathcal\{S\}is a set of \(candidate, pin set, target\) triples constructed analytically from the existing price and materiality label data—no additional LLM inference is required beyond the single extraction pass\. The backbone’s last\-token hidden states are extracted once per \(prompt, pin set\) pair and cached to disk; all subsequent training epochs operate entirely on the cacheddd\-dimensional vectors and complete in seconds on CPU\. In our experiments, 3,000 triples drawn from the 76K\-article corpus suffice for convergence with a single linear layer \(d=4096d=4096for Llama 3\.1 8B\)\.

##### Greedy pin selection\.

At each time step, instead of pushing all candidates above thresholdτ\\tauinto the queue, we select greedily:

c∗=arg⁡maxc∈𝒞t⁡Δ^​Rθ​\(c∣𝒫t\)c^\{\*\}=\\arg\\max\_\{c\\in\\mathcal\{C\}\_\{t\}\}\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\}\)\(8\)and repeat until the token budgetBpinB\_\{\\text\{pin\}\}is exhausted\. Proposition[5\.4](https://arxiv.org/html/2606.09877#S5.Thmtheorem4)\(see §[5](https://arxiv.org/html/2606.09877#S5)\) establishes that this greedy rule achieves a\(1−1/e\)\(1\-1/e\)approximation to the optimal pin set when wiki quality is submodular\.

### 4\.4Pin Eviction

Active pins compete for the limited budgetBpinB\_\{\\text\{pin\}\}\. We manage eviction via a decay\-weighted priority queue\. The priority of pinppat timettis:

priority​\(p,t\)=Δ^​Rθ​\(p∣𝒫t∖\{p\}\)⋅exp⁡\(−λ⋅\(t−tp\)\)\\text\{priority\}\(p,t\)=\\hat\{\\Delta\}R\_\{\\theta\}\(p\\mid\\mathcal\{P\}\_\{t\}\\setminus\\\{p\\\}\)\\cdot\\exp\\\!\\left\(\-\\lambda\\cdot\(t\-t\_\{p\}\)\\right\)\(9\)whereΔ^​Rθ​\(p∣𝒫t∖\{p\}\)\\hat\{\\Delta\}R\_\{\\theta\}\(p\\mid\\mathcal\{P\}\_\{t\}\\setminus\\\{p\\\}\)is the marginal regret score of pinppgiven all other currently pinned articles,tpt\_\{p\}is the pin’s creation time, andλ\>0\\lambda\>0controls the staleness decay rate\. Using the marginal score rather than a statelesss​\(p\)s\(p\)ensures that eviction preferentially removes whichever pin is most*redundant*relative to the remaining pin set, not merely which has the lowest individual score\.

ThePinSelectprocedure \(i\) updates priorities for all active pins, \(ii\) inserts new candidates from the queue, and \(iii\) evicts the lowest\-priority pins until the budget constraint is satisfied\. This reduces to a*weighted online paging*problem\(Fiatet al\.,[1991](https://arxiv.org/html/2606.09877#bib.bib46); Bansalet al\.,[2012](https://arxiv.org/html/2606.09877#bib.bib47)\)where page weights correspond to materiality scores and the cache size corresponds to the pin budget\.

### 4\.5Incremental vs\. Full Recompilation

##### Incremental compile\.

At each time step,IncrementalCompilepatches the existing wiki by \(i\) appending pinned facts to the relevant entity sections and \(ii\) recompiling the full KV prefix cache for each affected entity\. This costsO​\(\|𝒫tnew\|⋅\|prefixk\|\)O\(\|\\mathcal\{P\}\_\{t\}^\{\\text\{new\}\}\|\\cdot\|\\text\{prefix\}\_\{k\}\|\)for KV cache recompilation of the affected entitieskk, avoiding a full corpus recompilation\.

##### Full recompile\.

EveryTrT\_\{r\}steps, a fullBatchWiCERrecompilation runs on the accumulated corpus\. Pinned facts are treated as hard constraints \(they must appear in the output\), consistent with the original WiCER pinning mechanism\(Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1)\)\. This ensures that incrementally added facts are properly integrated into the wiki’s narrative structure\.

##### BatchWiCER subroutine \(brief description\)\.

BatchWiCER\(Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1)\)takes a corpus of articles𝒟\\mathcal\{D\}and token budgetBBand produces a compiled wikiWW\. Concretely: \(1\) an LLM extracts named facts per entity via a single forward pass per article; \(2\) facts are ranked by TF\-IDF relevance to the entity; \(3\) facts are greedily added until the budget is exhausted, with pinned facts included first as hard constraints; \(4\) a final LLM pass compiles selected facts into a coherent, encyclopedia\-style wiki section; \(5\) the resulting text is stored as a pre\-filled KV\-cache prefix for O\(1\) context injection at query time\. The compilation cost isO​\(\|𝒟\|⋅\|K\|\)O\(\|\\mathcal\{D\}\|\\cdot\|K\|\)LLM calls, dominated by fact extraction in step \(1\)\.

### 4\.6Volatility\-Driven Query Generation

##### Motivation\.

The formulation in §[3](https://arxiv.org/html/2606.09877#S3)treatsQtQ\_\{t\}as exogenous—queries arrive and are served against the current wiki, but the wiki management decisions are made without observing them\. In the experiments of §[7\.1\.8](https://arxiv.org/html/2606.09877#S7.SS1.SSS8)we instantiatedQtQ\_\{t\}as static quarterly templates, which creates a structural mismatch: pinning is driven by a dynamic, event\-specific signal \(predicted abnormal volatility\), while evaluation is driven by generic category questions whose gold answers do not change with the news stream\.

The fix is to make part ofQtQ\_\{t\}*endogenous*: the same high\-volatility events that trigger pinning should also trigger the generation of queries about those events\. This creates a self\-consistent evaluation loop—the wiki is judged on exactly the questions that motivated building it\.

##### Query set decomposition\.

We decomposeQtQ\_\{t\}into two components: The decompositionQt=Qtbg∪QtvolQ\_\{t\}=Q\_\{t\}^\{\\textup\{bg\}\}\\cup Q\_\{t\}^\{\\textup\{vol\}\}is defined in Equation \([2](https://arxiv.org/html/2606.09877#S3.E2)\) \(§[3](https://arxiv.org/html/2606.09877#S3)\)\.QtbgQ\_\{t\}^\{\\textup\{bg\}\}contains standing analyst questions loaded at initialisation and refreshed at coarse intervals\.QtvolQ\_\{t\}^\{\\textup\{vol\}\}contains event\-driven questions generated each step from newly pinned facts:

Qtvol=QueryGen​\(Δ​𝒫t\)Q\_\{t\}^\{\\textup\{vol\}\}=\\textsc\{QueryGen\}\(\\Delta\\mathcal\{P\}\_\{t\}\)\(10\)

##### QueryGen subroutine\.

For each newly pinned fact\(f,k\)∈Δ​𝒫t\(f,k\)\\in\\Delta\\mathcal\{P\}\_\{t\},QueryGencalls the frozen LLM once to produce a specific question and anchors its gold answer to the extracted fact:

Youareafinancialanalyst\.Giventhefollowing

extractedfactabout\{ticker\},generateonespecific

questionthataportfolioanalystwouldaskto

verifyorelaborateonthisdevelopment\.

Thequestionmustbeanswerablefromthefactitself\.

Bespecific:nametheevent,notthecategory\.

Fact:\{fact\}

Question:

The gold answera∗a^\{\*\}for the generated question is the extracted factffitself \(with optional LLM paraphrase for fluency\)\. This anchors evaluation to the information stream: a query about an NVDA earnings surprise is generated*only*when such a surprise is pinned, and its gold answer is grounded in the pinned fact rather than a generic headline list\.

##### Self\-consistency\.

With endogenousQtvolQ\_\{t\}^\{\\textup\{vol\}\}, the system achieves a closed loop between pinning and evaluation: a perfect scorer \(ε→0\\varepsilon\\to 0\) pins exactly the high\-volatility facts, generates queries about those facts, and the wiki—which contains those facts—answers them correctly, driving regret to zero\. A random scorer \(ε→0\.5\\varepsilon\\to 0\.5\) generates uncorrelated queries and pins, yielding no systematic quality improvement over a FIFO baseline\. The prediction errorε\\varepsilonthus governs*both*sides of the regret bound \(Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\): it bounds the fraction of material facts missed by the pinner, and equivalently, the fraction of event\-driven queries for which the wiki lacks the required context\.

##### Relation to standing queries\.

QtbgQ\_\{t\}^\{\\textup\{bg\}\}captures analyst needs that are*not*triggered by any single news event—portfolio\-level questions, sector comparisons, long\-horizon trend analyses\. These are best served by the base compiled wiki rather than the pin layer\. TreatingQtbgQ\_\{t\}^\{\\textup\{bg\}\}andQtvolQ\_\{t\}^\{\\textup\{vol\}\}as separate components naturally decomposes wiki quality into two regimes: the pin layer is evaluated againstQtvolQ\_\{t\}^\{\\textup\{vol\}\}\(did we capture the right events?\), and the base wiki is evaluated againstQtbgQ\_\{t\}^\{\\textup\{bg\}\}\(is the background knowledge current and comprehensive?\)\.

### 4\.7Predictive CEGAR Formalization

Table[1](https://arxiv.org/html/2606.09877#S4.T1)formalizes the correspondence between classical CEGAR, batch WiCER, and Online WiCER\. The key shift is from*reactive*counterexample discovery to*proactive*prediction: in Online WiCER, the “counterexample” is a high\-volatility news item predicted to cause future knowledge gaps, rather than an observed failure\.

Table 1:CEGAR↔\\leftrightarrowWiCER↔\\leftrightarrowOnline WiCER correspondence\.This predictive CEGAR paradigm generalizes beyond finance: in any domain where “counterexamples” \(knowledge gaps\) can be*predicted*rather than merely*observed*, proactive refinement can reduce the latency between a knowledge gap’s emergence and its resolution\.

## 5Theoretical Analysis

We analyze Online WiCER’s regret in terms of two sources of loss: \(1\) eviction regret from limited pin budget, and \(2\) prediction error from imperfect volatility scoring\.

###### Theorem 5\.1\(Regret decomposition\)\.

LetTTbe the time horizon,KKthe number of entities,BpinB\_\{\\text\{pin\}\}the pin budget \(in tokens\), andε=Pr⁡\[\|s^​\(k,n\)−𝒱​\(k,n\)\|\>δ\]\\varepsilon=\\Pr\[\|\\hat\{s\}\(k,n\)\-\\mathcal\{V\}\(k,n\)\|\>\\delta\]the volatility prediction error rate\. Under the multiplicative\-weights pin selection policy, the cumulative regret of Online WiCER satisfies:

Regret​\(T\)≤O​\(T​log⁡K\)⏟eviction regret\+O​\(Bmiss⋅T⋅ε\)⏟prediction error\\textup\{Regret\}\(T\)\\leq\\underbrace\{O\\\!\\left\(\\sqrt\{T\\log K\}\\right\)\}\_\{\\textup\{eviction regret\}\}\+\\underbrace\{O\\\!\\left\(B\_\{\\textup\{miss\}\}\\cdot T\\cdot\\varepsilon\\right\)\}\_\{\\textup\{prediction error\}\}\(11\)whereBmissB\_\{\\textup\{miss\}\}is the maximum quality loss from a single missing material fact\.

###### Proof sketch\.

The eviction component follows from the weighted experts framework\(Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.09877#bib.bib42)\): treating each potential pin as an “expert” and the quality loss from eviction as the loss function, multiplicative weights achievesO​\(T​log⁡Npins\)O\(\\sqrt\{T\\log N\_\{\\text\{pins\}\}\}\)regret, whereNpins≤K⋅\|𝒩t\|maxN\_\{\\text\{pins\}\}\\leq K\\cdot\|\\mathcal\{N\}\_\{t\}\|\_\{\\max\}\. Since at mostKKentities are relevant per step, this simplifies toO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)\.

The prediction error component bounds the expected number of material facts that are*missed*\(scored below thresholdτ\\tau\) or*spuriously pinned*\(scored aboveτ\\taubut not material\)\. Each missed material fact incurs at mostBmissB\_\{\\text\{miss\}\}quality loss\. With prediction error rateε\\varepsilon, the expected number of misses per step is at most\|𝒩t\|⋅ε\|\\mathcal\{N\}\_\{t\}\|\\cdot\\varepsilon, yielding theO​\(Bmiss⋅T⋅ε\)O\(B\_\{\\text\{miss\}\}\\cdot T\\cdot\\varepsilon\)term\. A full proof appears in Appendix[A](https://arxiv.org/html/2606.09877#A1)\. ∎

###### Theorem 5\.2\(Periodic recompilation convergence\)\.

If the fullBatchWiCERrecompilation runs everyTrT\_\{r\}steps, then within each recompilation window\[t,t\+Tr\]\[t,t\+T\_\{r\}\], the wiki quality satisfies:

𝒬​\(Wt\+Tr,q\)≥𝒬​\(Wtbatch,q\)−O​\(Tr⋅\|𝒩t\|avg⋅εBpin\)\\mathcal\{Q\}\(W\_\{t\+T\_\{r\}\},q\)\\geq\\mathcal\{Q\}\(W\_\{t\}^\{\\textup\{batch\}\},q\)\-O\\\!\\left\(\\frac\{T\_\{r\}\\cdot\|\\mathcal\{N\}\_\{t\}\|\_\{\\textup\{avg\}\}\\cdot\\varepsilon\}\{B\_\{\\textup\{pin\}\}\}\\right\)\(12\)whereWtbatchW\_\{t\}^\{\\textup\{batch\}\}is the quality achievable by a full batch WiCER recompile at timett\. Moreover, each full recompilation converges in at most 2 iterations, inheriting the convergence guarantee of the original WiCER algorithm\.

###### Proposition 5\.3\(Vanishing quality gap\)\.

Under the Online WiCER algorithm with periodic recompilation, the time\-averaged quality gap between the online and oracle wikis vanishes:

1T​∑t=1T\[𝒬​\(𝒲t∗,qt\)−𝒬​\(𝒲t,qt\)\]→T→∞0\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\[\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\}^\{\*\},q\_\{t\}\)\-\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\_\{t\}\)\\right\]\\xrightarrow\{T\\to\\infty\}0\(13\)provided the volatility prediction errorε\\varepsilonis bounded andTr=o​\(T\)T\_\{r\}=o\(T\)\.

###### Proposition 5\.4\(Submodular greedy approximation\)\.

Suppose wiki quality𝒬​\(𝒲t,q\)\\mathcal\{Q\}\(\\mathcal\{W\}\_\{t\},q\)is monotone submodular in the pin set𝒫t\\mathcal\{P\}\_\{t\}\. Then the greedy pin selection rule \(Equation \([8](https://arxiv.org/html/2606.09877#S4.E8)\)\) achieves a\(1−1/e\)≥0\.632\(1\-1/e\)\\geq 0\.632approximation to the optimal pin set of size at most⌊Bpin/f¯⌋\\lfloor B\_\{\\textup\{pin\}\}/\\bar\{f\}\\rfloorunder the token budgetBpinB\_\{\\textup\{pin\}\}, wheref¯\\bar\{f\}is the mean fact token length\.

###### Proof sketch\.

Monotone submodularity of𝒬\\mathcal\{Q\}implies that the marginal gainΔ​𝒬​\(c∣𝒫t\)=𝒬​\(𝒫t∪\{c\},q\)−𝒬​\(𝒫t,q\)\\Delta\\mathcal\{Q\}\(c\\mid\\mathcal\{P\}\_\{t\}\)=\\mathcal\{Q\}\(\\mathcal\{P\}\_\{t\}\\cup\\\{c\\\},q\)\-\\mathcal\{Q\}\(\\mathcal\{P\}\_\{t\},q\)is non\-negative and non\-increasing as𝒫t\\mathcal\{P\}\_\{t\}grows\. The greedy algorithm that at each step adds the element with largest marginal gain achieves a\(1−1/e\)\(1\-1/e\)factor of the optimal solution for any monotone submodular maximisation under a cardinality constraint\(Nemhauseret al\.,[1978](https://arxiv.org/html/2606.09877#bib.bib72)\)\. The regression\-head estimateΔ^​Rθ​\(c∣𝒫t\)\\hat\{\\Delta\}R\_\{\\theta\}\(c\\mid\\mathcal\{P\}\_\{t\}\)from Equations \([5](https://arxiv.org/html/2606.09877#S4.E5)\)–\([6](https://arxiv.org/html/2606.09877#S4.E6)\) serves as a proxy for the true marginal gain; the tightness of the\(1−1/e\)\(1\-1/e\)bound depends on how closelyΔ^​Rθ\\hat\{\\Delta\}R\_\{\\theta\}tracks the true marginal gainΔ​𝒬​\(c∣𝒫t\)\\Delta\\mathcal\{Q\}\(c\\mid\\mathcal\{P\}\_\{t\}\)\. ∎

The submodularity assumption holds approximately in practice: the marginal value of pinning a second article covering the same entity event \(high\|ρ\|\|\\rho\|\) is demonstrably lower than the first, satisfying the diminishing returns condition\.

##### Complexity\.

Per time step, Online WiCER performs: \(i\) fact extraction inO​\(\|𝒩t\|\)O\(\|\\mathcal\{N\}\_\{t\}\|\)LLM calls, \(ii\) backbone forward passes inO​\(\|𝒩t\|\)O\(\|\\mathcal\{N\}\_\{t\}\|\)to produce cached hidden states for all candidates, \(iii\) marginal scoring via two dot products per candidate—one for the stateless head and one for the state\-aware regression head—inO​\(\|𝒩t\|⋅d\)O\(\|\\mathcal\{N\}\_\{t\}\|\\cdot d\)whereddis the hidden dimension, \(iv\) greedy pin selection requiringO​\(\|𝒩t\|⋅\|𝒫t\|\)O\(\|\\mathcal\{N\}\_\{t\}\|\\cdot\|\\mathcal\{P\}\_\{t\}\|\)priority queue updates as the pin set grows, and \(v\) incremental compilation inO​\(\|𝒫tnew\|\)O\(\|\\mathcal\{P\}\_\{t\}^\{\\text\{new\}\}\|\)LLM calls\. The state\-aware scorer re\-evaluates all candidates after each greedy insertion step \(to update marginal gains\), but since each evaluation is a single dot product on cached vectors, the total cost isO​\(\|𝒩t\|2⋅d/Bpin\)O\(\|\\mathcal\{N\}\_\{t\}\|^\{2\}\\cdot d/B\_\{\\text\{pin\}\}\)in the worst case—negligible compared to the LLM forward passes\. The full recompilation everyTrT\_\{r\}steps costsO​\(B\)O\(B\)LLM calls, amortized toO​\(B/Tr\)O\(B/T\_\{r\}\)per step\.

## 6Experiments

We present two instantiations of the Streaming Knowledge Compilation framework, each using a different materiality signal and domain corpus\. The finance instantiation \(§[6\.1](https://arxiv.org/html/2606.09877#S6.SS1)\) uses abnormal stock volatility as the signal; the Wikipedia instantiation \(§[6\.2](https://arxiv.org/html/2606.09877#S6.SS2)\) uses the Abnormal Edit Ratio \(AER\)\. Both share identical algorithm, regret analysis, and evaluation protocol; only the materiality signal and corpus differ\.

### 6\.1Finance Instantiation

#### 6\.1\.1Data and Setup

##### News corpus\.

We collect 76,130 financial news articles from Finnhub for the top 39 S&P 500 companies by market capitalization, covering the period March 2025 – May 2026\. Articles are tagged with referenced tickers via entity recognition and keyword matching\. The corpus spans all major sectors including technology \(AAPL, MSFT, NVDA, GOOGL\), healthcare \(UNH, JNJ, LLY\), financials \(JPM, V, MA\), and energy \(XOM, CVX\)\.

##### Price data and volatility labels\.

Daily OHLCV data from Yahoo Finance for the same 39 entities \(March 2025 – May 2026\)\. We compute daily returns, rolling 60\-day volatility, and forward 5\-day realized volatility\. Volatility labels are assigned per Equation \([1](https://arxiv.org/html/2606.09877#S3.E1)\): a stock\-day is a high\-volatility event if its forward realized volatility exceeds2×2\\timesthe cross\-sectional market average\. The overall volatility event rate is 7\.2% in the aligned dataset, with substantial sector variation: Technology \(10\.8%\) and Consumer Cyclical \(13\.4%\) exhibit the highest rates, while defensive sectors like Consumer Staples \(0\.8%\) and Energy \(0\.0%\) rarely exhibit elevated relative volatility\. This class imbalance \(∼\\sim13:1 negative\-to\-positive ratio\) makes volatility prediction a challenging classification task\.

##### Train/validation/test split\.

We split the aligned dataset temporally: articles up to December 2025 for training \(45,672 articles,∼\\sim60%\), January–March 2026 for validation \(15,393,∼\\sim20%\), and after March 18, 2026 for testing \(15,020,∼\\sim20%\)\. This temporal split ensures no future information leakage\. The classification head is trained on the training set and all reported metrics are computed on the held\-out test set of 15,020 articles\.

##### QA benchmark construction\.

The end\-to\-end wiki evaluation \(§[7\.1\.8](https://arxiv.org/html/2606.09877#S7.SS1.SSS8)\) requires a time\-stamped QA benchmark that instantiates both query populations\.*Standing queries*\(QtbgQ\_\{t\}^\{\\textup\{bg\}\}\): for each of the 39 entities at each of 6 quarterly checkpoints \(2025\-Q1 through 2026\-Q2\) we generate one question per category \(earnings, strategy, risk, market, financial\), yielding 980 standing\-query pairs \(a subset of the theoretical maximum of 1,170, since some entities have sparse news at early checkpoints\)\. Question text is drawn from templates; gold answers are generated by the frozen Llama 3\.1 8B given the 90\-day lookback news window for that entity and checkpoint—one LLM call per \(entity, checkpoint, category\) triple, producing category\-specific answers rather than the same headline list for every question\.*Event\-driven queries*\(QtvolQ\_\{t\}^\{\\textup\{vol\}\}\): from the 5,295 high\-volatility articles in the aligned dataset \(AVR\>2\>2, 7\.0% event rate\), we sample 500 uniformly at random and invokeQueryGento produce one specific question per article, using the extracted fact as the gold answer\. These pairs test whether the pin layer—not the base wiki—contains the material information\. The two benchmarks are evaluated separately: standing\-query score measures base\-wiki quality; event\-query score measures pin\-layer quality\. The main QA experiment \(§[7\.1\.8](https://arxiv.org/html/2606.09877#S7.SS1.SSS8)\) draws a 200\-pair random sample from the combined benchmark \(approximate proportions:∼\\sim132 standing,∼\\sim68 event\-driven, reflecting the 66%/34% split of the full benchmark\)\.

##### Model and scoring configuration\.

We use Llama 3\.1 8B Instruct \(4\-bit quantized via MLX\) as the frozen backbone, running locally on Apple M4 with 24GB unified memory\. All heads \(classification and marginal regret regression\) are trained on top of the same frozen backbone weights\.

*Stateless classification head\.*A single linear projection from the 4096\-dimensional last hidden state to a scalar logit, followed by sigmoid activation\. Trained for 30 epochs with binary cross\-entropy loss, learning rate10−310^\{\-3\}, batch size 8, on a per\-article prompt of the form “Financial news about \{ticker\}: \{headline\}”\.

*State\-aware marginal regret regression head\.*A single linear projection from the same 4096\-dimensional last hidden state to a scalar output \(no sigmoid\), identical in structure to the classification head\. Input prompts encode both the candidate article and the current pin set \(up to 20 pins\) per the template in §[4\.3](https://arxiv.org/html/2606.09877#S4.SS3)\. Fit via Ridge regression \(α=1,000\\alpha=1\{,\}000, selected by validation MSE sweep\) on the MSE loss against the analytical target \(Equation \([5](https://arxiv.org/html/2606.09877#S4.E5)\)\)\. Training uses 3,000 \(candidate, pin set, target\) triples sampled from the training split; hidden states are extracted once and cached, so head fitting requires no further LLM forward passes and completes in under 5 seconds on CPU\. For the hybrid scorer, we setα=0\.4\\alpha=0\.4\.

#### 6\.1\.2Scorer Baselines

We compare three volatility scoring approaches, all built on the same frozen Llama 3\.1 8B backbone:

1. 1\.LLM zero\-shot: Prompt the model generatively to assess volatility signal \(no training data\)\.
2. 2\.Classification head: Trained linear head on frozen last hidden states \(supervised\)\.
3. 3\.Hybrid\(α=0\.4\\alpha=0\.4\): Weighted combination of zero\-shot and head scores per Equation \([4](https://arxiv.org/html/2606.09877#S4.E4)\)\.

Additionally, we compare against arandom baseline\(uniform\[0,1\]\[0,1\]scores\) and amajority\-class baseline\(always predict non\-material\)\.

#### 6\.1\.3Metrics

- •Materiality F1: Harmonic mean of precision and recall at the optimal threshold\.
- •Precision: Fraction of predicted\-material articles that are truly material\.
- •Recall: Fraction of truly material articles that are detected\.
- •AUROC: Area under the ROC curve \(threshold\-independent\)\.

We sweep classification thresholdsτ∈\{0\.1,0\.2,…,0\.8\}\\tau\\in\\\{0\.1,0\.2,\\ldots,0\.8\\\}and report the threshold achieving the best F1 for each scorer variant\.

#### 6\.1\.4Ablations

We ablate five dimensions: \(a\) hybrid mixing parameterα∈\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\}\\alpha\\in\\\{0\.0,0\.2,0\.4,0\.6,0\.8,1\.0\\\}\(whereα=0\\alpha=0is head\-only andα=1\\alpha=1is zero\-shot\-only\), \(b\) AVR threshold∈\{1\.5,2\.0,2\.5,3\.0\}\\in\\\{1\.5,2\.0,2\.5,3\.0\\\}controlling the positive\-class definition, \(c\) training set size \(25%, 50%, 75%, 100% of training data\), \(d\) volatility window \(3\-day, 5\-day, 10\-day, 20\-day forward returns\), and \(e\)backbone architecture: Llama 3\.1 8B, Gemma 3 4B, and Gemma 3 12B, testing whether the frozen\-LLM \+ classification\-head approach generalises across model families and parameter scales\.

### 6\.2Wikipedia Instantiation

The Wikipedia instantiation replaces the financial corpus and volatility signal with a publicly available, continuously updated knowledge source\. This tests whether Streaming Knowledge Compilation generalizes beyond finance without any modification to the algorithm\.

##### Corpus\.

We use the Wikimedia revision API to collect edit histories for 25 entities across five categories: AI companies \(OpenAI, DeepMind, Anthropic, Nvidia, Meta Platforms\), AI topics \(large language models, AGI, ChatGPT, GPT\-4, AlphaFold\), science and technology \(CRISPR, quantum computing, the James Webb Space Telescope, nuclear fusion, mRNA vaccines\), geopolitics \(NATO, EU, BRICS, G20, UN\), and public figures \(Elon Musk, Sam Altman, Demis Hassabis, Yoshua Bengio, Geoffrey Hinton\)\. We collect revision metadata over a two\-year window \(2024–2025\), totalling approximately 50,000–200,000 revisions depending on entity\.

##### Materiality signal: Abnormal Edit Ratio \(AER\)\.

The AER is the direct analog of the finance AVR:

AER​\(k,t\)=EditVelocity​\(k,t,w\)EditVelocity¯​\(t,w\),ϕt​\(k,⋅\)=𝟏​\[AER​\(k,t\)\>2\]\\text\{AER\}\(k,t\)=\\frac\{\\text\{EditVelocity\}\(k,t,w\)\}\{\\overline\{\\text\{EditVelocity\}\}\(t,w\)\},\\quad\\phi\_\{t\}\(k,\\cdot\)=\\mathbf\{1\}\[\\text\{AER\}\(k,t\)\>2\]\(14\)whereEditVelocity​\(k,t,w\)\\text\{EditVelocity\}\(k,t,w\)is the edit count for entitykkin theww\-day trailing window \(defaultw=7w=7\), and the denominator is the cross\-sectional mean across all 25 entities\. AER\>2\>2indicates an entity is being edited at more than twice the average rate—a reliable proxy for a breaking development\.

##### Content extraction\.

For high\-AER windows, we fetch the full wikitext of the triggering revision and its parent, compute character\-level diffs to extract newly added sentences \(via difflib\), and pass additions to the LLM for fact extraction\. The resulting \(entity, fact, AER score, date\) tuples serve as the pin candidates, directly analogous to the \(ticker, headline, AVR score, date\) tuples in the finance instantiation\.

##### QA benchmark\.

Standing queries \(QtbgQ\_\{t\}^\{\\textup\{bg\}\}\) are generated from the earliest snapshot of each entity \(3 questions per entity, sampled from representative sentences\)\. Event\-driven queries \(QtvolQ\_\{t\}^\{\\textup\{vol\}\}\) are generated byQueryGenapplied to facts extracted from high\-AER edit windows\. The same LLM judge \(Llama 3\.1 8B\) is used as in the finance instantiation\.

##### Oracle\.

The oracle uses ground\-truth labels: an AER window is positive if and only if it produced an event\-driven QA pair\. This mirrors the finance oracle \(which uses ground\-truth AVR labels\) and provides the same matched\-pair regret comparison\.

Results for the Wikipedia instantiation are presented in §[7\.2](https://arxiv.org/html/2606.09877#S7.SS2)\.

## 7Results

### 7\.1Finance Results

#### 7\.1\.1Materiality Prediction

Table 2:Finance instantiation: materiality prediction performance on held\-out test set \(abnormal volatility, AVR\>2\>2\)\. Thresholdτ∗\\tau^\{\*\}is optimized on validation set by F1\.Table[2](https://arxiv.org/html/2606.09877#S7.T2)presents the core evaluation on a temporally held\-out test set \(all articles after March 18, 2026, with training data restricted to articles before January 2026\): can we predict which news articles will cause abnormal stock volatility relative to the broader market? The majority\-class baseline achieves the highest accuracy \(0\.928\) by predicting all\-negative, exploiting the 13:1 imbalance; its accuracy equals the negative\-class base rate \(1−\-0\.072\) and reflects no learning\. AUROC is the appropriate metric here: it measures the probability that the model ranks a random positive above a random negative, and is therefore insensitive to class imbalance and threshold choice\. Accuracy and AUROC diverge whenever imbalance is severe—the majority\-class classifier has AUROC≈0\.5\\approx 0\.5\(random ranking\) despite 92\.8% accuracy, while the classification head has AUROC = 0\.728 despite lower accuracy \(0\.688\) due to its low operating threshold \(τ∗=0\.1\\tau^\{\*\}=0\.1, chosen to maximise F1\)\. The classification head achieves AUROC = 0\.728 and F1 = 0\.212 \(precision 0\.130, recall 0\.583\), outperforming the zero\-shot baseline \(AUROC = 0\.541, F1 = 0\.157\)\. Crucially, this is a*linear probe*: no gradient ever flows into the backbone, so the result directly measures whether abnormal volatility signal is linearly decodable from pretrained LLM representations\. AUROC = 0\.728 under a strict temporal split and 13:1 class imbalance indicates that it is\. The moderate recall is important for the Online WiCER use case: missing a material event \(false negative\) causes a knowledge gap that persists until the next recompilation, while a false positive merely wastes a small fraction of the pin budget\.

The zero\-shot scorer produces poorly calibrated scores \(mean = 0\.442, std = 0\.276\) and achieves only 9\.2% precision at its best F1 threshold, indicating that generative prompting alone cannot reliably distinguish material from non\-material news for the volatility task\.

The hybrid scorer \(α=0\.4\\alpha=0\.4\) produces statistically indistinguishable results from the head\-only baseline \(AUROC = 0\.729 vs\. 0\.728, F1 identical at 0\.212\)\. This is expected: on our test set, only 4% of articles fall in the ambiguous band\[0\.3,0\.7\]\[0\.3,0\.7\]where zero\-shot would be invoked \(Table[10](https://arxiv.org/html/2606.09877#S7.T10)\), so the hybrid operates as head\-only for 96% of the stream\. The hybrid’s contribution is*operational, not accuracy\-based*: it provides a principled escape hatch for borderline articles at minimal latency cost \(113\.9 ms vs\. 94\.8 ms for head\-only\), and ensures graceful degradation when the head encounters event types absent from training data\.

#### 7\.1\.2Score Distribution Analysis

The classification head produces a bimodal score distribution \(mean = 0\.235, std = 0\.379\): non\-material articles cluster near 0, while material articles receive higher scores\. The zero\-shot scorer, by contrast, produces a broad, roughly uniform distribution \(mean = 0\.442, std = 0\.276\) with poor separation between classes\. This bimodality of the head’s output enables better separation despite the 13:1 class imbalance, as reflected in the substantially higher AUROC \(0\.728 vs\. 0\.541\)\.

#### 7\.1\.3Ablation Results

Table 3:Ablation study on volatility prediction \(test set, AUROC and F1 at optimal threshold\)\.Table 4:Decay parameterλ\\lambdaablation: material pin retention after 30\-day streaming simulation \(100\-pin budget, articles arrive uniformly over 30 days\)\. Higher retention = more genuinely material articles kept in wiki\.Table[3](https://arxiv.org/html/2606.09877#S7.T3)presents ablation results across five dimensions and Table[4](https://arxiv.org/html/2606.09877#S7.T4)shows the decay ablation:

Hybridα\\alpha\.The mixing parameterα\\alphahas negligible effect on AUROC \(0\.703 across all values from 0\.0 to 1\.0\), indicating that zero\-shot scores and head scores are nearly collinear in their ranking of articles\. The head already captures the discriminative information that zero\-shot responses encode—adding zero\-shot signal to the mixture does not open a new separation axis\. This is an important null result: it means the hybrid’s value is*not*accuracy but*latency*\. By routing only the ambiguous 4% of articles \(head score∈\[0\.3,0\.7\]\\in\[0\.3,0\.7\]\) to zero\-shot, the hybrid achieves statistically identical accuracy to head\-only while retaining 83% of head\-only throughput \(31,607 vs\. 37,992 art/hr; Table[10](https://arxiv.org/html/2606.09877#S7.T10)\)—a latency\-accuracy router, not an ensemble accuracy booster\.

AVR threshold\.Lowering the threshold to 1\.5 substantially increases recall \(more positives\) and yields the best F1 of 0\.492 at the cost of a noisier signal\. Higher thresholds \(2\.5, 3\.0\) yield higher AUROC \(0\.763–0\.786\) because the model more cleanly separates extreme volatility events, but F1 collapses as positives become sparse\. The default AVR=2\.0=2\.0balances label quality and volume for the Online WiCER pinning use case\.

Training data\.Remarkably, AUROC peaks at 25% training data \(AUROC = 0\.767, F1 = 0\.258\) and the full dataset \(AUROC = 0\.725\) slightly underperforms\. This saturation is consistent with the linear probe interpretation: the backbone’s representation space already encodes volatility signal as a near\-linear structure, so adding more data calibrates the hyperplane margin without discovering qualitatively new signal\.

Volatility window\.Longer windows yield substantially better AUROC: 0\.890 at 10\-day and 0\.903 at 20\-day, compared to 0\.732–0\.736 for 3–5 day windows\. However, AUROC alone is misleading here\. At 10 days, recall reaches 1\.000 at threshold 0\.2—the model has learned to predict*everything*as positive, since virtually all stocks exhibit measurable volatility over a 10\-day window\. This near\-trivial classifier inflates AUROC while collapsing precision \(0\.085\), giving F1 = 0\.157, substantially below the 5\-day F1 of 0\.230\. The 5\-day window maximizes F1 at the operating threshold, providing the most discriminative signal for pinning decisions\. Wider windows also reduce timeliness: a 20\-day label requires waiting three weeks post\-publication before the target is known, creating a significant lag in the training signal that is untenable for a real\-time streaming system\.

Backbone\.We compare three frozen backbone architectures—Llama 3\.1 8B \(4096\-d\), Gemma 3 4B \(2560\-d\), and Gemma 3 12B \(3840\-d\)—holding all other hyperparameters fixed\. The results show a clear hierarchy: Llama 3\.1 8B \(AUROC = 0\.716\) substantially outperforms Gemma 3 12B \(0\.631,−\-8\.5 points\) and Gemma 3 4B \(0\.578,−\-13\.8 points\)\. This finding has two implications\. First, the frozen\-LLM \+ linear\-probe paradigm is*not*backbone\-agnostic for financial materiality prediction: the quality of the pre\-trained representations varies significantly across model families, and model size alone does not explain the gap \(Gemma 3 12B has more parameters than Llama 3\.1 8B yet underperforms by a wide margin\)\. Second, Llama’s pre\-training—which emphasizes instruction following and general\-purpose reasoning on a broad web corpus—appears to encode financial event semantics more linearly than Gemma’s\. This is consistent withGuo and Hauptmann \([2024](https://arxiv.org/html/2606.09877#bib.bib50)\), who find that Llama\-family decoder LLMs outperform alternatives across large stock universes; our linear\-probe framing provides a mechanistic explanation: Llama representations are more linearly separable for financial relevance classification\. The backbone ablation uses a 500\-article random sample of the test set \(same seed across all backbone runs\) to reduce compute; the Llama 3\.1 8B figure here \(AUROC = 0\.716\) is slightly below the full 15,020\-article test\-set result \(0\.728 in Table[2](https://arxiv.org/html/2606.09877#S7.T2)\), consistent with expected sampling variance at this subset size\. All backbone models are compared on the identical 500\-article subset, so the relative ordering is unaffected\.

Decay parameterλ\\lambda\.Table[4](https://arxiv.org/html/2606.09877#S7.T4)shows material pin retention under different decay rates in a 30\-day streaming simulation \(100\-pin budget, articles arriving uniformly\)\. Without decay \(λ=0\\lambda=0\), the queue retains the top\-100 articles by initial score and material retention is 30\.6%\. With modest decay \(λ≥0\.1\\lambda\\geq 0\.1\), temporal weighting slightly improves both retention \(33\.3%\) and precision in queue \(12\.0%\), as the model places relatively higher priority on recent articles, which tend to be more immediately relevant\. The effect is modest in this simulation; the primary role ofλ\\lambdain practice is to prevent stale high\-score articles from permanently blocking budget for new events\. We setλ=0\.1\\lambda=0\.1as the default\.

Recompile periodTrT\_\{r\}\.Table[5](https://arxiv.org/html/2606.09877#S7.T5)shows material pin retention under varying recompile periods in the same 30\-day simulation \(λ=0\.1\\lambda=0\.1, 100\-pin budget\)\. Retention is flat at 33\.3% forTr∈\{1,5\}T\_\{r\}\\in\\\{1,5\\\}, rises modestly to 36\.1% forTr∈\{10,20\}T\_\{r\}\\in\\\{10,20\\\}, and reaches 41\.7% forTr=50T\_\{r\}=50\(no recompile triggered within the 30\-day window\)\. The result confirms that the decay eviction mechanism alone is sufficient for the proxy metric: theλ\\lambda\-weighted priority queue already deprioritizes stale pins without the cost of a full recompile\. Very frequent recompilation \(Tr=1T\_\{r\}=1\) offers no retention benefit because re\-ranking by raw score on each step produces the same top\-100 as decay\-weighted selection at this scale\. The practical implication is thatTrT\_\{r\}can be chosen to meet compute budget constraints—weekly \(Tr=7T\_\{r\}=7\) or bi\-weekly \(Tr=14T\_\{r\}=14\) schedules are reasonable defaults—without materially affecting pin quality\. Note that the proxy metric measures only which facts are*present*in the wiki, not how well they are*integrated*; the semantic quality benefit of full recompilation \(coherent, non\-redundant wiki prose\) is captured by the QA metric in Table[11](https://arxiv.org/html/2606.09877#S7.T11), not by retention alone\.

Table 5:Recompile periodTrT\_\{r\}ablation: material pin retention after 30\-day streaming simulation \(λ=0\.1\\lambda=0\.1, 100\-pin budget, articles arriving uniformly\)\.Tr=50T\_\{r\}=50acts as the no\-recompile baseline \(first recompile falls outside the 30\-day window\)\. Retention is insensitive toTrT\_\{r\}, confirming that decay eviction handles freshness independently of recompilation frequency\.
#### 7\.1\.4Error Analysis

The classification head’s primary failure mode is false positives: atτ∗=0\.1\\tau^\{\*\}=0\.1, precision is 0\.130, reflecting the difficulty of the temporal prediction task under strict data splitting\. However, the AUROC of 0\.728 confirms that the model’s continuous scores carry substantial discriminative information, and recall of 0\.583 indicates the head captures a majority of genuinely material events\. For the Online WiCER application, this asymmetry is acceptable: false positives consume pin budget \(a recoverable cost via eviction\), while false negatives create persistent knowledge gaps\.

The zero\-shot scorer exhibits a similar recall \(55\.6%\) but even lower precision \(9\.2%\) and substantially worse AUROC \(0\.541, barely above chance\)\. This suggests that the frozen LLM’s internal representations \(exploited by the classification head\) contain richer volatility\-relevant features than its generative output\.

#### 7\.1\.5Downstream Volatility Validation

Beyond classification metrics, we validate whether the scorer’s predicted scores monotonically track genuinely elevated forward realized volatility\. On a random 500\-article sample of the held\-out test set, the 162 articles predicted high\-volatility \(score≥τ∗=0\.1\\geq\\tau^\{\*\}=0\.1\) exhibit a mean 5\-day realized volatility of 2\.62%, compared to 1\.76% for the 338 predicted low\-volatility articles—a1\.49×\\timesvolatility ratio\. The predicted high\-volatility group also has a substantially higher mean abnormal volatility ratio \(AVR = 1\.32 vs\. 0\.91\), confirming that the model identifies articles associated with firm\-specific information events rather than broad market moves\.

##### Score calibration\.

To test whether the continuous scores are well\-calibrated—not merely good at the binary threshold—we group the 500 test articles into three confidence bands and measure mean 5\-day absolute forward return \(Table[6](https://arxiv.org/html/2606.09877#S7.T6)\)\.

Table 6:Score calibration: mean absolute 5\-day forward return by predicted\-score band \(500\-article test sample\)\. Higher predicted score corresponds to monotonically higher realized price movement\.Articles in the high\-confidence band exhibit2\.04×\\timesthe mean absolute 5\-day return of low\-confidence articles \(4\.78% vs\. 2\.35%\), and carry a 12% materiality rate versus 0% for the lowest\-scored articles\. The Spearman rank correlation between predicted score and\|\|ret\|\+5\{\}\_\{\+5\}\|isρ=0\.222\\rho=0\.222\(p=7×10−7p=7\\times 10^\{\-7\}\), confirming a statistically significant monotonic relationship that extends beyond the binary classification boundary\. This calibration result is operationally important: it means the continuous score can be used directly as a priority signal in the pin queue \(Equation \([9](https://arxiv.org/html/2606.09877#S4.E9)\)\)—articles pinned with higher scores are more likely to correspond to genuine volatility events, making the decay\-weighted eviction mechanism meaningful rather than arbitrary\.

Table[7](https://arxiv.org/html/2606.09877#S7.T7)shows sector\-level breakdown\. Communication Services \(AUROC = 0\.744,n=35n=35\) and Healthcare \(AUROC = 0\.735,n=86n=86\) show the strongest discrimination, while Financial Services shows below\-chance performance \(AUROC = 0\.449\), likely because financial\-sector news is inherently noisy and frequently sector\-correlated\. Consumer Defensive and Energy sectors have zero positive\-rate in the test window, precluding AUROC computation\.

Table 7:Sector\-level volatility prediction AUROC on the test set \(AVR threshold 2\.0\)\. Last column: mean absolute 5\-day forward return for articles predicted material \(score≥τ∗=0\.1\\geq\\tau^\{\*\}\{=\}0\.1\) vs\. predicted non\-material; “—” means no articles in that predicted bin for the sector\.

#### 7\.1\.6Marginal Regret Scorer Evaluation

We evaluate the state\-aware marginal regret regression head described in §[4\.3](https://arxiv.org/html/2606.09877#S4.SS3)on the 3,000\-triple dataset constructed from the training split\. After dropping 134 triples with missing forward return data \(articles too close to the dataset boundary\), 2,866 triples remain \(mean target=0\.092±0\.118=0\.092\\pm 0\.118\)\.

##### Ridge regression sweep\.

We fit a Ridge regression head on the cached 4,096\-dimensional Llama 3\.1 8B hidden states, sweeping the regularisation coefficientα\\alphaon a held\-out validation split \(10% of triples\)\. Table[8](https://arxiv.org/html/2606.09877#S7.T8)shows the full sweep; the predict\-mean MSE baseline isσy2=0\.0139\\sigma^\{2\}\_\{y\}=0\.0139\.

Table 8:Ridge regression sweep for marginal regret head\. Predict\-mean baseline MSE = 0\.0139\. Best atα=1,000\\alpha=1\{,\}000\(val MSE = 0\.0110,R2=0\.21R^\{2\}=0\.21\)\.The low\-α\\alpharegime shows near\-zero training MSE but high validation MSE \(0\.0300\.030\), confirming thatd=4,096\>n=2,866d=4\{,\}096\>n=2\{,\}866makes the underdetermined problem susceptible to overfitting without regularisation\. The optimalα=1,000\\alpha=1\{,\}000achieves val MSE =0\.0110\.011, yieldingR2=1−0\.011/0\.0139=0\.21R^\{2\}=1\-0\.011/0\.0139=0\.21relative to the predict\-mean baseline\. This confirms that the frozen backbone’s last\-token representation encodes approximately 21% of the variance in the marginal regret target—a meaningful but partial signal, consistent with the frozen linear probe’s inability to perfectly recover the return\-correlation novelty term\|ρ​\(rc,rp\)\|\|\\rho\(r\_\{c\},r\_\{p\}\)\|from text alone\.

##### Qualitative validation\.

Table[9](https://arxiv.org/html/2606.09877#S7.T9)shows the trained scorer applied to a synthetic candidate headline \(“NVDA misses revenue estimates by 30%, shares fall 8%”\) under four pin\-set contexts\. Scores are real outputs of the regression head on the frozen Llama 3\.1 8B backbone; the headline is synthetic to provide a controlled, interpretable test case\.

Table 9:State\-aware marginal regret scores for a synthetic NVDA candidate under varying pin sets\. Scores are real outputs of the trained regression head\. Lower score = candidate is redundant given what is already pinned\.The scorer assigns dramatically lower scores to NVDA\-redundant contexts \(0\.010–0\.051\) than to contexts where NVDA is absent \(0\.142–0\.152\), a15×\\timesratiobetween the strongest redundancy and the baseline\. The unrelated \(AAPL\) and empty contexts score nearly identically, confirming that cross\-entity pins do not artificially suppress the candidate’s marginal value\. This is precisely the behaviour required for the greedy selection rule \(Equation \([8](https://arxiv.org/html/2606.09877#S4.E8)\)\): when two similarly volatile articles arrive about the same entity, the second is scored an order of magnitude lower, freeing pin budget for coverage of other entities\.

##### Limitation\.

TheR2=0\.21R^\{2\}=0\.21confirms a useful but imperfect signal\. The frozen linear probe cannot fully recover the return\-correlation novelty term from text alone, asρ​\(rc,rp\)\\rho\(r\_\{c\},r\_\{p\}\)is a latent price\-series quantity not directly observable from headlines\. A non\-linear head or lightweight fine\-tuning of the backbone \(e\.g\. LoRA\) is expected to improve the fit; we leave this to future work\.

##### Trading signal caveat\.

A long\-predicted\-volatile strategy achieves mean 5\-day return of 0\.043% \(annualised Sharpe = 0\.049\), compared to buy\-and\-hold at 0\.194% \(Sharpe = 0\.283\)\. This*underperformance*is expected and desirable: the model predicts volatility*magnitude*, not price*direction*\. Abnormal volatility is equally likely to be caused by good or bad news, so a long\-only position on predicted\-volatile articles should not systematically outperform\. The appropriate use of the scorer is as a*pinning signal*for knowledge curation, not as a directional trading signal\.

#### 7\.1\.7Computational Efficiency and Stream Capacity

Throughput is a first\-class concern for a streaming system: the scorer must process incoming articles*faster*than they arrive, or it becomes the bottleneck that defeats the purpose of proactive pinning\. We benchmark all four scoring paths on Apple M4 \(24GB unified memory\) using 4\-bit quantized Llama 3\.1 8B via MLX, measuring wall\-clock latency over 50 held\-out articles after 5 warm\-up passes \(Table[10](https://arxiv.org/html/2606.09877#S7.T10)\)\. We contextualise against the Finnhub S&P 500 feed:∼\\sim120 articles/hour at steady state and∼\\sim500 articles/hour at peak \(earnings season\)\.

Table 10:Scorer latency and stream capacity on Apple M4 24GB \(Llama 3\.1 8B, 4\-bit\)\. Peak stream:∼\\sim500 articles/hour \(earnings season\)\.*Capacity ratio*= throughput / peak stream rate\.The classification head path \(frozen backbone forward pass \+ single dot product\) matches the hidden\-state\-only baseline almost exactly \(94\.8 ms vs\. 94\.5 ms mean\), confirming that the linear dot product adds negligible overhead\. Both easily exceed the peak stream rate by a factor of76×76\\times\. The zero\-shot path is4\.0×4\.0\\timesslower \(456\.9 ms mean\) due to autoregressive token generation, though it still provides15\.8×15\.8\\timesheadroom over the peak stream\. The hybrid design limits zero\-shot invocations to articles whose head score falls in the ambiguous band\[0\.3,0\.7\]\[0\.3,0\.7\]; on our benchmark, only 4% of articles are ambiguous, so the hybrid path reduces mean latency to 113\.9 ms—recovering 83% of the head\-only throughput \(31,607 vs\. 37,992 art/hr\) while inheriting the accuracy gains of zero\-shot on hard cases\. Training the classification head on 5,000 samples \(feature extraction \+ 20 epochs of SGD\) completes in under 15 minutes on the same hardware\.

#### 7\.1\.8End\-to\-End Wiki QA Evaluation

The preceding sections establish that the volatility scorer is calibrated and the system is computationally viable\. We now ask the central question:*does a materiality\-scored wiki actually improve downstream QA quality, and does the improvement concentrate on the right query population?*

##### Setup\.

We evaluate five wiki\-management strategies using the QA benchmark described in §[6](https://arxiv.org/html/2606.09877#S6), drawing a 200\-pair random sample from the combined benchmark \(approximately 132 standing, 68 event\-driven, reflecting the 66%/34% composition of the full 1,480\-pair dataset\)\. An LLM judge \(Llama 3\.1 8B\) rates each answer on a 1–5 scale against gold answers generated from category\-specific context \(standing\) or the extracted fact \(event\-driven\)\. Results are reported separately forQtbgQ\_\{t\}^\{\\textup\{bg\}\}\(standing queries; tests base\-wiki quality\) andQtvolQ\_\{t\}^\{\\textup\{vol\}\}\(event\-driven queries; tests pin\-layer quality\), and aggregated overall\. Strategies:*No Wiki*\(direct LLM\),*Static Wiki*\(compiled once from first 60 days, never updated\),*FIFO*\(all articles pinned, oldest evicted\),*Online WiCER*\(τ∗=0\.5\\tau^\{\*\}\{=\}0\.5,λ=0\.1\\lambda\{=\}0\.1\), and*Oracle*\(ground\-truth AVR labels\)\.

Table 11:End\-to\-end QA quality by wiki strategy and query population \(200 pairs, LLM\-as\-judge 1–5\)\.QtbgQ\_\{t\}^\{\\textup\{bg\}\}: standing queries \(base\-wiki quality\)\.QtvolQ\_\{t\}^\{\\textup\{vol\}\}: event\-driven queries \(pin\-layer quality; gold answers are post\-training facts outside the model’s parametric memory\)\. No Wiki scores highest on both populations due to the LLM\-as\-judge confound \(see §[7\.1\.8](https://arxiv.org/html/2606.09877#S7.SS1.SSS8)\)\. The regret analysis \(Table[12](https://arxiv.org/html/2606.09877#S7.T12)\) provides the clean WiCER\-vs\-oracle comparison\.
##### Findings\.

Table[11](https://arxiv.org/html/2606.09877#S7.T11)reveals a pervasive LLM\-as\-judge confound that affects both query populations\.

*Standing queries \(Qt*bg*Q\_\{t\}^\{\\textup\{bg\}\}\)\.*All strategies score within 0\.07 points of each other \(3\.80–3\.87\)\. No Wiki is marginally highest \(3\.87\), consistent with the expected confound: standing questions ask about persistent financial concerns that are largely answerable from the model’s parametric memory, and the judge—the same backbone—rates confident parametric responses highly\.

*Event\-driven queries \(Qt*vol*Q\_\{t\}^\{\\textup\{vol\}\}\)\.*The confound extends, unexpectedly, to event\-driven queries as well\. No Wiki scores highest \(3\.97\), followed by Static Wiki \(3\.83\), Online WiCER \(3\.71\), FIFO \(3\.56\), and Oracle \(3\.45\)\. The gold answers forQtvolQ\_\{t\}^\{\\textup\{vol\}\}are post\-training facts extracted from 2025–2026 news, which the backbone has not seen; nevertheless, the judge rates No Wiki’s confident \(but likely factually incorrect\) parametric responses more highly than wiki\-grounded answers\. This occurs because the LLM judge evaluates surface fluency and apparent confidence rather than factual accuracy against specific post\-training events\. Notably, Oracle—which pins the most material content—scores*lowest*: dense pinning clutters context with specific facts the judge cannot verify, reducing scores relative to a fluent parametric baseline\. This finding underscores that LLM\-as\-judge is not a reliable evaluation surface for knowledge that post\-dates the backbone’s training cutoff\.

*Regret analysis\.*Because the absolute QA scores are confounded, we turn to the regret analysis as the primary empirical result\. Table[12](https://arxiv.org/html/2606.09877#S7.T12)tracks cumulative regret on matched QA pairs across Online WiCER and Oracle runs \(173 matched pairs total\)\. Regret is defined as∑t\[𝒬​\(Wt∗,qt\)−𝒬​\(Wt,qt\)\]\\sum\_\{t\}\[\\mathcal\{Q\}\(W^\{\*\}\_\{t\},q\_\{t\}\)\-\\mathcal\{Q\}\(W\_\{t\},q\_\{t\}\)\]; a negative value means Online WiCER outperforms the oracle\.

Table 12:Cumulative regret of Online WiCER vs\. oracle on 173 matched QA pairs \(representative steps shown\)\. Regret==oracle score−\-WiCER score; a negative total means WiCER scores higher than the oracle under this judge\. Total cumulative regret=−20\.0=\-20\.0; mean per\-step regret=−0\.12=\-0\.12\. See text for confound interpretation\.The trajectory begins positive: cumulative regret reaches\+3\.0\+3\.0at step 50, where the oracle’s ground\-truth AVR labels give it an early advantage on clear\-cut high\-volatility events\. After step 50 the trajectory reverses and settles at−20\.0\-20\.0by step 173, meaning WiCER scores 0\.12 points*higher*than the oracle on average under this judge\.

This sign reversal is diagnostic of the LLM\-as\-judge confound identified in §LABEL:sec:qa\_finance, not evidence of algorithmic superiority\. The oracle’s high\-AVR pins are genuinely material, but they constitute dense post\-training facts that the backbone judge cannot verify; dense pinning therefore depresses oracle scores relative to WiCER’s sparser, less cluttered context\. Stated precisely: the oracle is optimal for volatility\-based curation but is*not*optimal for the reward the judge measures—so negative regret here indicates a misalignment between the oracle’s selection criterion and the evaluation metric, not that WiCER has found a better curation policy\. The sub\-linear, bounded shape of the trajectory is nonetheless consistent with theO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)bound of Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\. The clean empirical validation of the theorem—where the oracle genuinely dominates and WiCER tracks below it as predicted—comes from the confound\-free Wikipedia instantiation \(§[7\.2](https://arxiv.org/html/2606.09877#S7.SS2)and Table[14](https://arxiv.org/html/2606.09877#S7.T14)\)\.

### 7\.2Wikipedia Results

Table[13](https://arxiv.org/html/2606.09877#S7.T13)reports the same metrics as Table[11](https://arxiv.org/html/2606.09877#S7.T11)\. The Wikipedia results exhibit the*inverse*pattern from finance: No Wiki scores lowest \(3\.80\) while all wiki\-augmented methods score higher, with FIFO and Oracle reaching 4\.74\. This reversal directly validates our hypothesis about the finance confound: the backbone has strong parametric knowledge of S&P 500 companies \(entity\-level overlap\), but far weaker knowledge of specific Wikipedia edit events from 2024–2025 \(genuinely post\-training content\)\. Consequently, in the Wikipedia domain, richer context consistently produces higher QA scores—as one would expect from a system designed to supply information the LLM does not already know\.

A secondary finding is that FIFO matches Oracle on overall score \(both 4\.74±\\pm0\.60,N=232N=232\): for our selected 25 high\-edit\-velocity entities, recency alone captures the relevant content nearly as well as AER\-guided selection\. Online WiCER \(4\.57±\\pm0\.74,N=119N=119\) scores lower on aggregate because its selective pinning covers fewer QA pairs—but on the event\-driven subset it answers, it performs comparably \(4\.62 ev\)\. The regret analysis on 119 matched pairs confirms that AER scoring introduces positive regret \(\+16\.0\+16\.0total,\+0\.13\+0\.13/step\) relative to the oracle, reflecting the noisier signal compared to the finance domain’s volatility predictor\.

Table 13:Wikipedia instantiation: QA quality by strategy and query population\. Same protocol as Table[11](https://arxiv.org/html/2606.09877#S7.T11); materiality signal is AER \(§[6\.2](https://arxiv.org/html/2606.09877#S6.SS2)\)\. Unlike finance \(Table[11](https://arxiv.org/html/2606.09877#S7.T11)\), wiki\-augmented methods outperform No Wiki, confirming that Wikipedia edit content is genuinely post\-training for the backbone\.##### Cross\-instantiation comparison\.

The two instantiations share the identical algorithm and evaluation protocol; only the materiality signal and corpus differ\. Table[14](https://arxiv.org/html/2606.09877#S7.T14)compares regret convergence across both domains, providing multi\-domain empirical validation of the Streaming Knowledge Compilation framework\. The sign difference in cumulative regret is itself informative: finance shows negative regret \(−0\.12\-0\.12/step\) because the LLM\-judge confound inflates No Wiki \(the oracle baseline’s counterfactual\), while Wikipedia shows positive regret \(\+0\.13\+0\.13/step\) from a confound\-free evaluation where richer context genuinely helps\. The Wikipedia domain thus provides the cleaner regret signal; the finance domain provides the stronger materiality predictor \(AUROC 0\.728 vs\. noisier AER heuristic\)\.

Table 14:Cross\-instantiation regret comparison\. Finance regret is negative due to LLM\-judge confound \(entity\-level parametric overlap inflates oracle baseline\); Wikipedia regret is positive from confound\-free evaluation\. Both regret series converge sub\-linearly, confirming Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\.

## 8Discussion and Conclusion

##### Streaming Knowledge Compilation as a general problem\.

The central contribution of this work is the formalization of Streaming Knowledge Compilation—maintaining a budget\-bounded compiled wiki against a streaming corpus under query uncertainty—and the demonstration that it admits a general algorithmic solution with domain\-agnostic regret guarantees\. The two instantiations \(finance and Wikipedia\) share identical algorithm, theory, and evaluation protocol; only the materiality signalϕt\\phi\_\{t\}is domain\-specific\. This separability is the key architectural insight: an application domain expert specifiesϕt\\phi\_\{t\}\(abnormal volatility, edit velocity, clinical urgency, citation rate\), and the Online WiCER algorithm handles the rest\.

##### Reactive to proactive CEGAR\.

Classical CEGAR and batch WiCER discover knowledge gaps only when a query fails; Online WiCER*predicts*gaps from the stream before queries arrive\. This requires a fundamental modelling substitution: since queries are not observed at pinning time, the materiality signalϕt\\phi\_\{t\}acts as a*query relevance surrogate*\. The cost of this substitution is explicit in the regret decomposition \(Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\): theO​\(Bmiss⋅T⋅ε\)O\(B\_\{\\text\{miss\}\}\\cdot T\\cdot\\varepsilon\)prediction error term captures exactly the regret incurred when the surrogate misfires\. The substitution succeeds whenever the domain’s high\-materiality events are the events users subsequently query about—a reasonable assumption in finance, Wikipedia, clinical settings, and legal research alike\.

##### Graceful degradation\.

Our regret bound \(Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)\) decomposes cleanly into eviction and prediction components\. Even with imperfect volatility prediction \(ε\>0\\varepsilon\>0\), Online WiCER degrades gracefully: the eviction component still achieves sublinear regret through the multiplicative\-weights mechanism\. In the limit of random predictions \(ε→0\.5\\varepsilon\\to 0\.5\), Online WiCER reduces to a weighted FIFO baseline, which still outperforms an unmanaged wiki\.

##### Unified architecture benefits\.

Using a single frozen Llama 3\.1 8B backbone for all scoring paths yields practical advantages: \(i\) a single model to deploy and maintain, simplifying the inference stack on both edge devices \(Apple M4\) and cloud accelerators \(AWS Inferentia\), \(ii\) all heads—the classification head, the hybrid zero\-shot path, and the marginal regret regression head—reuse the same cached hidden states from a single backbone forward pass per article, and \(iii\) the zero\-shot path is invoked only for ambiguous cases \(shead∈\[0\.3,0\.7\]s\_\{\\text\{head\}\}\\in\[0\.3,0\.7\]\), while clear\-cut articles are scored by the head alone, substantially reducing autoregressive decoding costs \(see Table[10](https://arxiv.org/html/2606.09877#S7.T10)\)\.

##### From stateless to state\-aware pinning\.

The shift from the stateless classification head to the state\-aware marginal regret scorer represents a fundamental change in the decision criterion: rather than asking “is this article important?” in isolation, the system asks “does this article add information that is not already captured by the current pin set?” This distinction matters at pin budget boundaries: when two similarly important articles arrive simultaneously \(e\.g\., two earnings\-surprise stories for correlated stocks\), the stateless scorer cannot distinguish them and may pin both at the cost of evicting lower\-scored but complementary information\. The marginal scorer’sρ\\rho\-based novelty term directly penalises this redundancy\. The frozen\-backbone design ensures this richer reasoning costs nothing in backbone compute—only an additional dot product per candidate per greedy step—and its\(1−1/e\)\(1\-1/e\)approximation guarantee \(Proposition[5\.4](https://arxiv.org/html/2606.09877#S5.Thmtheorem4)\) provides a principled lower bound on pin\-set quality\.

##### Limitations\.

\(1\) Materiality prediction is inherently imperfect; rare “black swan” events may be missed regardless of domain\. \(2\) In the finance instantiation, cross\-entity volatility propagation \(e\.g\., a supplier’s disruption affecting a manufacturer\) is modeled only through sector\-level wikis, not direct supply\-chain links\. \(3\) In both instantiations, the cross\-sectional normalization of the materiality signal may be distorted during correlated crises when all entities exhibit elevated activity simultaneously\. \(4\) The LLM\-as\-judge evaluation is confounded on post\-training knowledge in both instantiations; the regret analysis on matched pairs is the reliable metric but requires running both Online WiCER and Oracle over the same QA pairs\. \(5\) In the Wikipedia instantiation, FIFO matches Oracle on aggregate score \(both 4\.74\), limiting the visible benefit of AER\-guided selection; this reflects the entity selection bias \(25 pre\-chosen high\-edit\-velocity articles\) rather than a fundamental limitation of AER as a signal\.

##### Future work\.

Extensions include: \(1\) additional domain instantiations \(clinical literature, legal filings, patent streams\) to broaden the multi\-domain validation; \(2\) a domain\-naïve judge \(a different model family with no parametric overlap\) to fully eliminate the LLM\-judge confound; \(3\) adaptive recompile scheduling triggered by cumulative pin volume rather than fixed intervals; \(4\) multi\-entity interaction modeling via a knowledge graph over the wiki hierarchy; and \(5\) online learning of the scorerϕ^t\\hat\{\\phi\}\_\{t\}to drive the prediction\-error termε→0\\varepsilon\\to 0as the system accumulates domain signal\.

##### Conclusion\.

We have formalized*Streaming Knowledge Compilation*—the problem of maintaining a budget\-bounded compiled wiki against a continuous document stream under query uncertainty—and introduced Online WiCER as its algorithmic solution\. The key insight is that a domain\-specific materiality signalϕt​\(k,n\)\\phi\_\{t\}\(k,n\), used as a proxy for query relevance, enables proactive pinning before any query arrives, reducing knowledge latency from reactive to predictive\. The regret boundO​\(T​log⁡K\)O\(\\sqrt\{T\\log K\}\)holds for any boundedϕt\\phi\_\{t\}; the prediction\-error termε=𝔼​\[\|ϕt−ϕ^t\|\]\\varepsilon=\\mathbb\{E\}\[\|\\phi\_\{t\}\-\\hat\{\\phi\}\_\{t\}\|\]is the only domain\-specific quantity\. We validate this claim empirically in two domains: in*finance*, a frozen Llama 3\.1 8B classification head predicts abnormal stock volatility \(AUROC = 0\.728,1\.49×1\.49\\timesrealized volatility ratio\), and cumulative regret over 173 matched pairs converges to−20\.0\-20\.0\(mean−0\.12\-0\.12/step\); in*Wikipedia*, the Abnormal Edit Ratio \(AER\) serves asϕt\\phi\_\{t\}, applying the same algorithm to a non\-financial public corpus with no algorithm modification; cumulative regret over 119 matched pairs is\+16\.0\+16\.0\(mean\+0\.13\+0\.13/step\), with positive sign reflecting the confound\-free evaluation where richer context consistently helps \(No Wiki 3\.80 vs\. Oracle 4\.74\)\. A methodological finding of independent interest emerges: LLM\-as\-judge evaluation is confounded on post\-training facts—the judge rates surface fluency rather than factual accuracy for knowledge outside its parametric memory—making regret analysis on matched pairs the reliable metric for compiled knowledge systems\. The state\-aware marginal regret scorer extends independent article scoring to set\-level selection via submodular maximization with a greedy\(1−1/e\)\(1\-1/e\)approximation guarantee, at negligible latency \(63×63\\timesthroughput headroom over peak stream volume on commodity hardware\)\. Streaming Knowledge Compilation is a broadly applicable problem: wherever a knowledge system must be kept current against a high\-velocity stream—financial news, clinical literature, legal filings, encyclopedic edits—the Online WiCER framework provides a principled, theoretically grounded, domain\-adaptable solution\.

## References

- Understanding intermediate layers using linear classifier probes\.International Conference on Learning Representations Workshop\.Note:arXiv preprint arXiv:1610\.01644Cited by:[§4\.2](https://arxiv.org/html/2606.09877#S4.SS2.SSS0.Px3.p1.1)\.
- T\. G\. Andersen and T\. Bollerslev \(1998\)Answering the skeptics: Yes, standard volatility models do provide accurate forecasts\.International Economic Review39\(4\),pp\. 885–905\.External Links:[Document](https://dx.doi.org/10.2307/2527343)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Araci \(2019\)FinBERT: financial sentiment analysis with pre\-trained language models\.arXiv preprint arXiv:1908\.10063\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Atkins, M\. Niranjan, and E\. H\. Gerding \(2018\)Financial news predicts stock market volatility better than close price\.The Journal of Finance and Data Science4\(2\),pp\. 120–137\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- S\. R\. Baker, N\. Bloom, and S\. J\. Davis \(2016\)Measuring economic policy uncertainty\.The Quarterly Journal of Economics131\(4\),pp\. 1593–1636\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- N\. Bansal, N\. Buchbinder, and J\. Naor \(2012\)Weighted paging with predictions\.InProceedings of the ACM\-SIAM Symposium on Discrete Algorithms,Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px6.p1.1),[§4\.4](https://arxiv.org/html/2606.09877#S4.SS4.p2.1)\.
- O\. E\. Barndorff\-Nielsen and N\. Shephard \(2002\)Econometric analysis of realized volatility and its use in estimating stochastic volatility models\.Journal of the Royal Statistical Society: Series B64\(2\),pp\. 253–280\.External Links:[Document](https://dx.doi.org/10.1111/1467-9868.00336)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- F\. Black and M\. Scholes \(1973\)The pricing of options and corporate liabilities\.Journal of Political Economy81\(3\),pp\. 637–654\.External Links:[Document](https://dx.doi.org/10.1086/260062)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- S\. T\. Bodilsen and A\. Lunde \(2025\)Exploiting news analytics for volatility forecasting\.Journal of Applied Econometrics40\(1\),pp\. 18–36\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- T\. Bollerslev \(1986\)Generalized autoregressive conditional heteroskedasticity\.Journal of Econometrics31\(3\),pp\. 307–327\.External Links:[Document](https://dx.doi.org/10.1016/0304-4076%2886%2990063-1)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Boudoukh, R\. Feldman, S\. Kogan, and M\. P\. Richardson \(2019\)Information, trading, and volatility: Evidence from firm\-specific news\.Review of Financial Studies32\(3\),pp\. 992–1033\.External Links:[Document](https://dx.doi.org/10.1093/rfs/hhy114)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- L\. Bybee, B\. T\. Kelly, A\. Manela, and D\. Xiu \(2024\)Business news and business cycles\.The Journal of Finance79\(5\),pp\. 3105–3147\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Caldara and M\. Iacoviello \(2022\)Measuring geopolitical risk\.American Economic Review112\(4\),pp\. 1194–1225\.External Links:[Document](https://dx.doi.org/10.1257/aer.20191823)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.
- Y\. Cao, Z\. Chen, P\. Kumar, Q\. Pei, Y\. Yu, H\. Li, F\. Dimino, L\. Ausiello, K\.P\. Subbalakshmi, and P\. M\. Ndiaye \(2024\)RiskLabs: predicting financial risk using large language model based on multimodal and multi\-sources data\.arXiv preprint arXiv:2404\.07452\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.
- N\. Cesa\-Bianchi and G\. Lugosi \(2006\)Prediction, learning, and games\.Cambridge University Press\.Cited by:[§A\.1](https://arxiv.org/html/2606.09877#A1.SS1.3.p3.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px6.p1.1),[§5](https://arxiv.org/html/2606.09877#S5.1.p1.4)\.
- B\. J\. Chan, C\. Chen, J\. Cheng, and H\. Huang \(2025\)Don’t do RAG: when cache\-augmented generation is all you need for knowledge tasks\.InCompanion Proceedings of the ACM on Web Conference 2025,External Links:[Document](https://dx.doi.org/10.1145/3701716.3715490)Cited by:[§1](https://arxiv.org/html/2606.09877#S1.p1.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Chen, B\. T\. Kelly, and D\. Xiu \(2022\)Expected returns and large language models\.SSRN Working Paper 4416687\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- J\. C\. Choi, J\. Kwon, A\. Lopez\-Lira, C\. Kim,et al\.\(2025\)FinAgentBench: a benchmark dataset for agentic retrieval in financial question answering\.InNeurIPS Workshop on Generative AI in Finance,Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px8.p1.1)\.
- K\. Christensen, M\. Siggaard, and B\. Veliyev \(2023\)A machine learning approach to volatility forecasting\.Journal of Financial Econometrics21\(5\),pp\. 1680–1727\.External Links:[Document](https://dx.doi.org/10.1093/jjfinec/nbac032)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- E\. Clarke, O\. Grumberg, S\. Jha, Y\. Lu, and H\. Veith \(2000\)Counterexample\-guided abstraction refinement\.InInternational Conference on Computer Aided Verification,pp\. 154–169\.Cited by:[§1](https://arxiv.org/html/2606.09877#S1.p1.1)\.
- F\. Corsi \(2009\)A simple approximate long\-memory model of realized volatility\.Journal of Financial Econometrics7\(2\),pp\. 174–196\.External Links:[Document](https://dx.doi.org/10.1093/jjfinec/nbp001)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Tuptanova \(2019\)BERT: pre\-training of deep bidirectional transformers for language understanding\.Proceedings of NAACL\-HLT,pp\. 4171–4186\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, and J\. Larson \(2024\)From local to global: a graph RAG approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px1.p1.1)\.
- R\. F\. Engle \(1982\)Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation\.Econometrica50\(4\),pp\. 987–1007\.External Links:[Document](https://dx.doi.org/10.2307/1912773)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Fiat, R\. M\. Karp, M\. Luby, L\. A\. McGeoch, D\. D\. Sleator, and N\. E\. Young \(1991\)Competitive paging algorithms\.Journal of Algorithms12\(4\),pp\. 685–699\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px6.p1.1),[§4\.4](https://arxiv.org/html/2606.09877#S4.SS4.p2.1)\.
- P\. Glasserman and H\. Mamaysky \(2019\)Does unusual news forecast market stress?\.Journal of Financial and Quantitative Analysis54\(4\),pp\. 1937–1974\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Grewal, C\. Hauptmann, and G\. Serafeim \(2021\)Material sustainability information and stock price informativeness\.Journal of Business Ethics171,pp\. 513–544\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.
- T\. Guo and E\. Hauptmann \(2024\)Fine\-tuning large language models for stock return prediction using newsflow\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track,pp\. 1052–1065\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.emnlp-industry.77)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1),[§7\.1\.3](https://arxiv.org/html/2606.09877#S7.SS1.SSS3.p6.2)\.
- T\. A\. Hassan, S\. Hollander, L\. van Lent, and A\. Tahoun \(2019\)Firm\-level political risk: measurement and effects\.Quarterly Journal of Economics134\(4\),pp\. 2135–2202\.External Links:[Document](https://dx.doi.org/10.1093/qje/qjz021)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.
- E\. Hazan \(2016\)Introduction to online convex optimization\.Foundations and Trends in Optimization2\(3–4\),pp\. 157–325\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px6.p1.1)\.
- S\. He, L\. Lv, A\. Manela, and J\. Wu \(2025\)Chronologically consistent large language models\.arXiv preprint arXiv:2502\.21206\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- T\. A\. Henzinger, R\. Jhala, R\. Majumdar, and G\. Sutre \(2002\)Lazy abstraction\.InProceedings of the 29th ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages,pp\. 58–70\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px7.p1.1)\.
- S\. L\. Heston \(1993\)A closed\-form solution for options with stochastic volatility with applications to bond and currency options\.Review of Financial Studies6\(2\),pp\. 327–343\.External Links:[Document](https://dx.doi.org/10.1093/rfs/6.2.327)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px3.p1.1)\.
- J\. M\. Huerta \(2026\)WiCER: wiki\-memory compile, evaluate, refine—iterative knowledge compilation for llm wiki systems\.arXiv preprint arXiv:2605\.07068\.Cited by:[§A\.2](https://arxiv.org/html/2606.09877#A1.SS2.3.p3.1),[§1](https://arxiv.org/html/2606.09877#S1.p1.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px1.p1.1),[§4\.5](https://arxiv.org/html/2606.09877#S4.SS5.SSS0.Px2.p1.1),[§4\.5](https://arxiv.org/html/2606.09877#S4.SS5.SSS0.Px3.p1.4)\.
- J\. Kasai, K\. Kasai, Y\. Bisk, H\. Hajishirzi, and N\. A\. Smith \(2023\)Realtime QA: what’s the answer right now?\.Advances in Neural Information Processing Systems\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px8.p1.1)\.
- M\. Khan, G\. Serafeim, and A\. Yoon \(2016\)Corporate sustainability: first evidence on materiality\.The Accounting Review91\(6\),pp\. 1697–1724\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.
- P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.Advances in Neural Information Processing Systems33,pp\. 9459–9474\.Cited by:[§1](https://arxiv.org/html/2606.09877#S1.p1.1),[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Li, Y\. Sun, Y\. Lin, X\. Gao, S\. Shang, and R\. Yan \(2024\)CausalStock: deep end\-to\-end causal discovery for news\-driven multi\-stock movement prediction\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Liska, T\. Kocisky, E\. Gribovskaya, T\. Terber, E\. Seo, E\. Sezener, A\. Glaese, B\. Bohnet, M\. Auli, and T\. B\. Hashimoto \(2022\)StreamingQA: a benchmark for adaptation to new knowledge over time in question answering models\.International Conference on Machine Learning\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px8.p1.1)\.
- A\. Lopez\-Lira and Y\. Tang \(2023\)Can ChatGPT forecast stock price movements? return predictability and large language models\.arXiv preprint arXiv:2304\.07619\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Manela and A\. Moreira \(2017\)News implied volatility and disaster concerns\.Journal of Financial Economics123\(1\),pp\. 137–162\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- K\. L\. McMillan \(2006\)Lazy abstraction with interpolants\.International Conference on Computer Aided Verification,pp\. 123–136\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px7.p1.1)\.
- G\. L\. Nemhauser, L\. A\. Wolsey, and M\. L\. Fisher \(1978\)An analysis of approximations for maximizing submodular set functions\.Mathematical Programming14\(1\),pp\. 265–294\.Cited by:[§5](https://arxiv.org/html/2606.09877#S5.3.p1.8)\.
- P\. Sarthi, S\. Abdullah, A\. Tuli, S\. Khanna, A\. Goldie, and C\. D\. Manning \(2024\)RAPTOR: recursive abstractive processing for tree\-organized retrieval\.arXiv preprint arXiv:2401\.18059\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Tan, M\. A\. Merrill, V\. Gupta, T\. Althoff, and T\. Hartvigsen \(2024\)Are language models actually useful for time series forecasting?\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- P\. Wang, Z\. Li, N\. Zhang, Z\. Xu, Y\. Yao, Y\. Jiang, P\. Xie, F\. Huang, and H\. Chen \(2024a\)WISE: rethinking the knowledge memory for lifelong model editing of large language models\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px6.p1.1)\.
- X\. Wang, M\. Feng, J\. Qiu, J\. Gu, and J\. Zhao \(2024b\)From news to forecast: integrating event analysis in LLM\-based time series forecasting with reflection\.InAdvances in Neural Information Processing Systems,Vol\.37\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Wu, O\. Irsoy, S\. Lu, V\. Dabravolski, M\. Dredze, S\. Gehrmann, P\. Kambadur, D\. Rosenberg, and G\. Mann \(2023\)BloombergGPT: a large language model for finance\.arXiv preprint arXiv:2303\.17564\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Xie, W\. Han, Y\. Lai, M\. Peng, and J\. Huang \(2023\)The Wall Street Neophyte: a zero\-shot analysis of ChatGPT over multimodal stock movement prediction challenges\.arXiv preprint arXiv:2304\.05351\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Z\. Xing, E\. Cambria, and Y\. Zhang \(2019\)Sentiment\-aware volatility forecasting\.Knowledge\-Based Systems176,pp\. 68–76\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- H\. Yang, X\. Liu, and C\. D\. Wang \(2023\)FinGPT: open\-source financial large language models\.arXiv preprint arXiv:2306\.06031\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Zhang, H\. Yang, and X\. Liu \(2023\)InstructFinGPT: financial sentiment analysis by instruction tuning of general\-purpose large language models\.arXiv preprint arXiv:2306\.12659\.Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Zhang, Y\. Zhang, M\. Cucuringu, and Z\. Qian \(2024\)Volatility forecasting with machine learning and intraday commonality\.Journal of Financial Econometrics22\(2\),pp\. 492–530\.External Links:[Document](https://dx.doi.org/10.1093/jjfinec/nbad005)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Zhao, D\. Ding, C\. Luo, and Y\. Wan \(2025\)Using large language models to estimate novel risk: impact on volatility\.Journal of Portfolio Management51\(7\),pp\. 230–248\.External Links:[Document](https://dx.doi.org/10.3905/jpm.2025.51.7.230)Cited by:[§2](https://arxiv.org/html/2606.09877#S2.SS0.SSS0.Px5.p1.1)\.

## Appendix AProofs

### A\.1Proof of Theorem[5\.1](https://arxiv.org/html/2606.09877#S5.Thmtheorem1)

###### Proof\.

We decompose the regret into two terms corresponding to \(a\) the loss from evicting high\-volatility pins due to budget constraints, and \(b\) the loss from mispredicting volatility\.

Eviction regret\.Consider the pin selection problem as an online learning problem withNNexperts, where each expert corresponds to a potential pin\. At each time steptt, the algorithm selects a subset𝒫t⊂\{1,…,N\}\\mathcal\{P\}\_\{t\}\\subset\\\{1,\\ldots,N\\\}with total token cost∑p∈𝒫t\|fp\|≤Bpin\\sum\_\{p\\in\\mathcal\{P\}\_\{t\}\}\|f\_\{p\}\|\\leq B\_\{\\text\{pin\}\}\. The loss from not including a material pinp∗p^\{\*\}in𝒫t\\mathcal\{P\}\_\{t\}is at mostBmissB\_\{\\text\{miss\}\}\. We apply the multiplicative weights bound to this subset selection problem by noting that the optimal fixed pin setS∗=arg⁡minS:\|S\|≤Bpin​∑tℓt​\(S\)S^\{\*\}=\\arg\\min\_\{S:\|S\|\\leq B\_\{\\text\{pin\}\}\}\\sum\_\{t\}\\ell\_\{t\}\(S\)is itself a valid single comparator; the standard single\-expert regret bound therefore applies withlog⁡K\\log Kexperts, sinceNpins≤K⋅\|𝒩t\|maxN\_\{\\text\{pins\}\}\\leq K\\cdot\|\\mathcal\{N\}\_\{t\}\|\_\{\\max\}and at mostKKentities are relevant per step\.

Using the multiplicative weights framework\[Cesa\-Bianchi and Lugosi,[2006](https://arxiv.org/html/2606.09877#bib.bib42)\], with learning rateη=log⁡K/T\\eta=\\sqrt\{\\log K/T\}:

∑t=1Tℓt​\(𝒫t\)−minS:\|S\|≤Bpin​∑t=1Tℓt​\(S\)≤2​T​log⁡K⋅Bmiss\\sum\_\{t=1\}^\{T\}\\ell\_\{t\}\(\\mathcal\{P\}\_\{t\}\)\-\\min\_\{S:\|S\|\\leq B\_\{\\text\{pin\}\}\}\\sum\_\{t=1\}^\{T\}\\ell\_\{t\}\(S\)\\leq 2\\sqrt\{T\\log K\}\\cdot B\_\{\\text\{miss\}\}\(15\)whereℓt​\(𝒫t\)\\ell\_\{t\}\(\\mathcal\{P\}\_\{t\}\)is the quality loss from the pin set𝒫t\\mathcal\{P\}\_\{t\}at timett\.

Prediction error\.At each time step, the volatility scorer may \(i\) miss a material fact \(false negative, probabilityεFN\\varepsilon\_\{\\text\{FN\}\}\) or \(ii\) pin a non\-material fact \(false positive, probabilityεFP\\varepsilon\_\{\\text\{FP\}\}\)\. A false negative causes at mostBmissB\_\{\\text\{miss\}\}quality loss\. A false positive wastes pin budget, potentially causing eviction of a material pin\.

The expected prediction\-error loss per step is:

𝔼​\[ℓtpred\]≤\|𝒩t\|⋅εFN⋅Bmiss\+\|𝒩t\|⋅εFP⋅BmissBpin\\mathbb\{E\}\[\\ell\_\{t\}^\{\\text\{pred\}\}\]\\leq\|\\mathcal\{N\}\_\{t\}\|\\cdot\\varepsilon\_\{\\text\{FN\}\}\\cdot B\_\{\\text\{miss\}\}\+\|\\mathcal\{N\}\_\{t\}\|\\cdot\\varepsilon\_\{\\text\{FP\}\}\\cdot\\frac\{B\_\{\\text\{miss\}\}\}\{B\_\{\\text\{pin\}\}\}\(16\)Summing overTTsteps and boundingεFN\+εFP≤2​ε\\varepsilon\_\{\\text\{FN\}\}\+\\varepsilon\_\{\\text\{FP\}\}\\leq 2\\varepsilon:

∑t=1T𝔼​\[ℓtpred\]≤O​\(Bmiss⋅T⋅ε\)\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\[\\ell\_\{t\}^\{\\text\{pred\}\}\]\\leq O\(B\_\{\\text\{miss\}\}\\cdot T\\cdot\\varepsilon\)\(17\)
Combining both terms yields the stated bound\. ∎

### A\.2Proof of Theorem[5\.2](https://arxiv.org/html/2606.09877#S5.Thmtheorem2)

###### Proof\.

Between recompilations \(timetttot\+Trt\+T\_\{r\}\), the wiki accumulates incremental pin patches\. The quality deviation from a hypothetical full recompilation at each step is bounded by the number of “stale” facts that would be reorganized in a full compile\.

At each step, at most\|𝒩t\|avg⋅ε\|\\mathcal\{N\}\_\{t\}\|\_\{\\text\{avg\}\}\\cdot\\varepsilonpins are misplaced \(material facts missed or non\-material facts pinned\)\. OverTrT\_\{r\}steps, the cumulative misplacement isTr⋅\|𝒩t\|avg⋅εT\_\{r\}\\cdot\|\\mathcal\{N\}\_\{t\}\|\_\{\\text\{avg\}\}\\cdot\\varepsilontokens\. Normalizing by the pin budget:

𝒬​\(Wt\+Trincr,q\)≥𝒬​\(Wtbatch,q\)−c⋅Tr⋅\|𝒩t\|avg⋅εBpin\\mathcal\{Q\}\(W\_\{t\+T\_\{r\}\}^\{\\text\{incr\}\},q\)\\geq\\mathcal\{Q\}\(W\_\{t\}^\{\\text\{batch\}\},q\)\-c\\cdot\\frac\{T\_\{r\}\\cdot\|\\mathcal\{N\}\_\{t\}\|\_\{\\text\{avg\}\}\\cdot\\varepsilon\}\{B\_\{\\text\{pin\}\}\}\(18\)for a constantccdepending on the maximum quality sensitivity to a single misplaced pin\.

The batch WiCER convergence guarantee\[Huerta,[2026](https://arxiv.org/html/2606.09877#bib.bib1)\]ensures that each full recompilation converges in at most 2 iterations of the Compile–Evaluate–Refine loop, restoring quality to the batch optimum\. ∎

### A\.3Proof of Proposition[5\.3](https://arxiv.org/html/2606.09877#S5.Thmtheorem3)

###### Proof\.

The time\-averaged regret is:

Regret​\(T\)T≤O​\(T​log⁡K\)T\+O​\(Bmiss⋅ε\)=O​\(log⁡KT\)\+O​\(ε\)\\frac\{\\text\{Regret\}\(T\)\}\{T\}\\leq\\frac\{O\(\\sqrt\{T\\log K\}\)\}\{T\}\+O\(B\_\{\\text\{miss\}\}\\cdot\\varepsilon\)=O\\\!\\left\(\\sqrt\{\\frac\{\\log K\}\{T\}\}\\right\)\+O\(\\varepsilon\)\(19\)The first term vanishes asT→∞T\\to\\infty\. The second term is bounded byε\\varepsilon, which can be driven to zero with improved volatility prediction \(e\.g\., more training data, online learning of the scorer\)\. Ifε=O​\(T−β\)\\varepsilon=O\(T^\{\-\\beta\}\)for someβ\>0\\beta\>0\(the scorer improves over time\), then both terms vanish\. ∎

## Appendix BExact Prompts for Reproducibility

### B\.1Zero\-Shot Volatility Scorer Prompt

The following prompt is used verbatim for the zero\-shot scoring path \(§[4\.2](https://arxiv.org/html/2606.09877#S4.SS2)\)\. The placeholders\{ticker\}and\{headline\}are substituted at inference time\. The prompt is delivered as a user message in the chat template; the system message instructs the model to respond with a single number\.

Listing 1:Zero\-shot volatility scoring prompt \(exact text\)\.System:

Youareaquantitativefinancialanalyst\.RespondwithONLYa

singledecimalnumberbetween0and1\.Noexplanation\.

User:

Onascaleof0to1,ratetheprobabilitythatthefollowing

newsheadlineabout\{ticker\}willcauseABNORMALstockprice

volatility\-\-definedasrealized5\-dayreturnvolatility

exceeding2timesthecurrentcross\-sectionalaverageacross

allS&P500stocks\.

Consider:earningssurprises,majorM&Aactivity,CEOchanges,

regulatory/legalactions,productrecalls,guidancerevisions,

activistinvestoractivity,andmacroeconomicshocksspecific

tothiscompany\.

Headline:\{headline\}

Respondwithasinglenumberbetween0and1\.

### B\.2Fact Extraction Prompt

The following prompt is used byExtractFactsto convert a raw headline into a pinnable fact sentence for the wiki\.

Listing 2:Fact extraction prompt \(exact text\)\.System:

Extractfactsconcisely\.Respondwithonesentenceonly\.

User:

Extractthesinglemostimportantfinancialfactfromthis

newsabout\{ticker\}:

\{headline\}

### B\.3Wiki Compilation Prompt

Used byBatchWiCERto compile a set of extracted facts into a coherent wiki section\.

Listing 3:Wiki compilation prompt \(exact text\)\.System:

Youareafinancialencyclopediaeditor\.Writeconcise,

factualwikientries\.

User:

Compilethefollowingnewsabout\{entity\}intoawikisection

coveringkeyevents,financialperformance,andoutlook\.

200\-300words,encyclopediastyle\.

Recentnews:

\{headlines\_list\}

Similar Articles

Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI

This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.

LLM Wiki v2 (16 minute read)

TLDR AI

This post presents a pattern for building personal knowledge bases using LLMs, offering a structured approach for leveraging large language models in knowledge management.

Whose Facts Win? LLM Source Preferences under Knowledge Conflicts

arXiv cs.CL

This paper investigates how LLMs handle knowledge conflicts in retrieval-augmented generation by studying their preferences for different information sources. The authors find that LLMs prefer institutionally-corroborated sources but these preferences can be reversed by repetition, proposing a method to reduce repetition bias while maintaining consistent source preferences.