LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction
Summary
LEDGER is a new benchmark for evaluating long-context capabilities of LLMs on corporate annual reports, providing 4,999 digitized reports with 31 financial KPIs and three evaluation tasks spanning retrieval and extraction.
View Cached Full Text
Cached at: 06/12/26, 08:51 AM
# LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction
Source: [https://arxiv.org/html/2606.13100](https://arxiv.org/html/2606.13100)
###### Abstract\.
Finance reporting is a natural proving ground for large language models, and the very\-long\-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need\. Yet most public financial resources reduce the task to plain\-text SEC 10\-K filings paired with a handful of question–answer items\. We releaseLedger\(Long\-context Evaluation of Documents for Grounded Extraction and Retrieval\), a corpus of 4,999 digitized corporate*annual reports*— full documents with figures, tables, and narrative, not just regulatory filings\. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market’s reaction at the earnings date\. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page\-level KPI retrieval task with TREC\-style relevance judgments over 118,048 questions in natural language, a conversational “needle\-in\-a\-haystack” single\-value lookup, and a full KPI extraction task, both from long, numerically dense reports\. We additionally provide human OCR\-quality annotations with inter\-annotator agreement and the complete extraction, validation, and scoring toolchain\. We further demonstrate the dataset’s research utility with a case study linking CEO\-letter rhetoric to post\-publication market impact\.
Finance, Long\-context evaluation, Information retrieval, Document understanding, Key performance indicators, Benchmark datasets
## 1\.Introduction
Corporate financial reports provide critical data on the health and results of a company, but their significant length and complexity pose a heavy cognitive burden on analysts\. Despite these barriers, their release triggers swift, high\-volatility market reactions, and the critical window to assess these reports before market opens and reacts is very short, generally of one hour\. Consequently, automated tools capable of extracting actionable intelligence from dense financial text in near\-real\-time would benefit any financial investor\. Large language models \(LLMs\) promise to read, understand, and extract such information, even more with the rapid growth of context windows – now reaching hundreds of thousands of tokens even for small, locally hostable models\. Whether models can actually*ground*their answers in such documents is, however, poorly measured\. Most public financial resources reduce the problem to short, plain\-text excerpts of U\.S\. SEC 10\-K filings paired with a few hundred question–answer pairs\(Chen et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib5); Zhu et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib34); Islam et al\.,[2023](https://arxiv.org/html/2606.13100#bib.bib11)\)\. Real financial analysis instead operates over the full*annual report*: a long, visually dense “glossy” document interleaving narrative, statutory statements, hundreds of tables, and figures, where the same quantity \(“revenue”\) is reported under company\-specific labels, scales, and scopes\.
We releaseLedger, a resource built to evaluate retrieval and extraction on exactly these documents\. We start from 4,999 OCR’d annual reports and label them with a ground\-truth KPIs, natural language questions and corresponding page\-level relevance\. The resource is organized as a difficulty spectrum—from locating a page, to extracting one value, to extracting an entire financial statement—and ships with the complete toolchain used to build and score it\. Our contributions are:
- •A corpus of labeled reports: reports of 738 companies over fiscal years 2009–2024, with 31 consolidated KPIs for each and market data aligned with their earning date\. Each of the 118,048 KPI labels comes with a natural language question, and page\-level relevancy labels \(Section[3](https://arxiv.org/html/2606.13100#S3)\)\.
- •Three benchmarks with baselines: page\-level KPI retrieval \(TREC qrels\), single\-value conversational “needle\-in\-a\-haystack” lookup, and full multi\-KPI extraction, with sparse/learned\-sparse IR and four open\-weight LLM baselines \(Section[4](https://arxiv.org/html/2606.13100#S4)\)\.
- •A research\-utility case studylinking CEO\-letter rhetoric to earnings per share surprise and post\-publication market impact \(Section[5](https://arxiv.org/html/2606.13100#S5)\)\.
- •Open releaseunder the MIT license for the code and Creative Commons Attribution 4\.0 for the data, with a permanent DOI and the full extraction/validation toolchain\.
## 2\.Related resources
Finance has been an early and active target for LLMs, both as a training domain \(e\.g\. BloombergGPT\(Wu et al\.,[2023](https://arxiv.org/html/2606.13100#bib.bib32)\), trained on proprietary data\) and as an evaluation domain\. Numerical reasoning benchmarks such as FinQA\(Chen et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib5)\), ConvFinQA\(Chen et al\.,[2022](https://arxiv.org/html/2606.13100#bib.bib6)\)and TAT\-QA\(Zhu et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib34)\)pair short table\-plus\-text snippets with arithmetic questions; FinanceBench\(Islam et al\.,[2023](https://arxiv.org/html/2606.13100#bib.bib11)\)broadens this to open\-book QA over 10\-K filings, and DocFinQA\(Reddy et al\.,[2024](https://arxiv.org/html/2606.13100#bib.bib22)\)extends the context to document scale\. Retrieval\-oriented work has shown that document structure matters for financial RAG\(Jimeno Yepes et al\.,[2024](https://arxiv.org/html/2606.13100#bib.bib12)\), and visual document retrieval benchmarks such as ViDoRe\(Loison et al\.,[2026](https://arxiv.org/html/2606.13100#bib.bib17)\)evaluate page retrieval over rendered PDFs\.Ledgerdiffers along four axes simultaneously: \(i\) it covers*full annual reports*—not just 10\-K text—including the glossy front matter, figures, and CEO letters that 10\-Ks omit; \(ii\) documents are genuinely long\-context \(≈\\approx126k tokens, median 124 pages\); \(iii\) it provides*authoritative numeric labeling*at scale, with≈\\approx30 KPIs per company\-year reconciled against SEC XBRL filings rather than crowd answers; and \(iv\) it couples the documents to a market\-reaction signal, enabling downstream financial\-research tasks\. The whole resource is openly licensed and accompanied by a datasheet\(Gebru et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib8)\)and FAIR\-compliant\(Wilkinson et al\.,[2016](https://arxiv.org/html/2606.13100#bib.bib31)\)metadata\.
## 3\.TheLedgercorpus
Figure 1\.Overview ofLedgercorpus\.### 3\.1\.Financial Reports and OCR
Ledgercorpus is built from a set of 4,999 publicly available corporate annual reports in PDF format\. Using DeepSeek\-OCR\-2\(Wei et al\.,[2025](https://arxiv.org/html/2606.13100#bib.bib29),[2026](https://arxiv.org/html/2606.13100#bib.bib30)\), we converted eachLEDGERreport into a single page\-aligned Markdown file in which pages are delimited by an explicit<\-\-\- Page Split \-\-\-\>marker, tables are rendered as HTML/LaTeX, and per\-page raster images are retained for downstream visual tasks\.
Because every downstream task depends on OCR fidelity, we provide a human assessment of table\-extraction quality\. Fifteen annotators graded the table alignment and contents≈\\approx1,150 extracted financial tables asok/not\_ok/uncertainthrough a purpose\-built interface that renders each detected table next to its source page image\. A 273\-table subset was triple\-coded to measure agreement: we observe 87\.2% full agreement, 91\.3% pairwise agreement, and a Fleiss’κ\\kappaof0\.81\(substantial agreement\)\. At the table level, 81\.5% of tables were graded as correctly aligned, giving a concrete, auditable estimate of OCR table fidelity that downstream users can condition on\. Overall, number extraction is nearly perfect despite rare occasional misrepresentations; nevertheless, layout alignments can be suboptimal for complex tables\.
Table 1\.TheLedgerdataset statistics\. The Eval companies are a subset of the Total corpus\. Tokens were counted using thecl100k\_basetokenizer\.
### 3\.2\.KPI labeling
We attach consolidated KPIs to every company\-year report, organized across the three financial statements:*\(i\) income*– revenue, cost of revenue, gross profit, R&D, SG&A, operating income, interest expense, income\-tax expense, net income, basic and diluted EPS;*\(ii\) balance sheet*– total assets, total liabilities, stockholders’ equity and its incl\.\-NCI variant, cash & equivalents and its incl\.\-restricted variant, long\-term debt at three scopes, short\-term borrowings, inventory, receivables, payables, shares outstanding; and*\(iii\) cash flow*– operating, investing, financing flows, capex, depreciation & amortization, dividends paid\. EBITDA is deliberately excluded: it is reporting\-standard dependent and derivable from the released components\. Values are routed through a three\-tier source waterfall: SEC EDGARcompanyfactsXBRL for U\.S\. listings, Yahoo\!Finance API \(through theyfinancepython package\) as a fallback for non\-U\.S\. listings, and Alpha Vantage as an opt\-in gap\-fill that never overwrites the first two\. The result is 118,048 audited and consolidated facts with per\-KPI yearly coverage above 85% for the core line items, with a fiscal year convention that handles 52 and 53\-week year reportings\.
### 3\.3\.Natural language questions and relevant pages for KPI
For every \(company, year, KPI\) triplet, we generated a natural\-language question to simulate a usage in a chat user interface\. To do so, we first collected the company alternative names using DBpedia\(Auer et al\.,[2007](https://arxiv.org/html/2606.13100#bib.bib2)\), in order to automatically reproduce the variability of semantics of human users as done in\(Malherbe and Aufaure,[2016](https://arxiv.org/html/2606.13100#bib.bib18)\)\. Second, we generated different variants of the question for the given KPI using Gemini 3\.1 Pro\(Google DeepMind,[2026](https://arxiv.org/html/2606.13100#bib.bib9)\), in the form of templates wherecompany nameandyearare to be replaced\. In practice, since the KPI and the company occur several times in our set of triplets, we sampled uniformly the KPI question template and company name once for every triplet\.
In a context of RAG on the pages of the corresponding report, we annotated the relevant page for each query\. To do so, we took the corresponding ground\-truth KPI value, and searched for pages of the report containing this value, producing TREC\-style relevance judgments\(Voorhees et al\.,[2005](https://arxiv.org/html/2606.13100#bib.bib27)\)\. From 118,048 questions we mine over a million candidate \(query, page\) pairs via unit\-normalized value matching and grade each on a 0/1/2 scale \(not relevant / contextual mention / primary source\) with an LLM judge \(in our case Qwen 3\.6\(Qwen Team,[2026](https://arxiv.org/html/2606.13100#bib.bib21)\)\), yielding a gradedqrelsfile\. A manual spot\-check of 60 graded pairs by a domain expert yielded 91\.6% agreement with the judge; a full multi\-annotator study is planned for a future version of this work\.
### 3\.4\.Market\-reaction linkage
For each U\.S\. company\-year we link the report to the market’s reaction at publication\. Using EDGAR’ssubmissionsfeed we select the original 10\-K \(excluding amendments\) whose filer\-labelled FY matches, take the earliestacceptanceDateTime\(UTC\), and classify it as pre\-market / intraday / after\-hours against NYSE hours in Eastern time\. We consider raw returns, defined as:
\(1\)rh=Pt−Pt\+hPt,r\_\{h\}=\\frac\{P\_\{t\}\-P\_\{t\+h\}\}\{P\_\{t\}\},wherePtP\_\{t\}is the closing price on the earnings date \(or the prior close if published pre\-market\)\. Note that since we will compare with a past date \(i\.e\.h<0h<0\), our choice of sign means thatrh\>0r\_\{h\}\>0expresses a rise in the stock price\. We also compute the Earnings Per Share \(EPS\) Surprise as
\(2\)EPS Surprise=Reported EPS−Consensus EPS\|Consensus EPS\|\\text\{EPS Surprise\}=\\frac\{\\text\{Reported EPS\}\-\\text\{Consensus EPS\}\}\{\|\\text\{Consensus EPS\}\|\}where the denominator ensures the sign of the surprise is preserved, and the EPS is for the fourth quarter of the financial year\.
## 4\.Benchmarks and baselines
The resource induces three tasks of increasing difficulty over the same documents and ground truth label: find the page given the question, extract one KPI given the question, extract every KPI\. For the two extraction tasks \(Sections[4\.2](https://arxiv.org/html/2606.13100#S4.SS2)–[4\.3](https://arxiv.org/html/2606.13100#S4.SS3)\) we report therecall==correct values / gold values \(a non\-answer counts as a miss\) andprecision==correct / attempted \(non\-answers and unverifiable extras excluded\)\. For the single\-KPI extraction, recall coincides with the exact\-match*accuracy*reported in the long\-context literature\(Hsieh et al\.,[2024](https://arxiv.org/html/2606.13100#bib.bib10); Liu et al\.,[2024](https://arxiv.org/html/2606.13100#bib.bib15)\)\. To fit into our 4\-GPU setup in a reasonable time, we run our experiments over 494 labeled reports of our corpus, with 13,519 KPI ground\-truth labels in total \(27\.2 KPI/report\)\. For the two first tasks, we kept 10,000 questions for which there is at least one relevant page\.
### 4\.1\.Page\-level retrieval
With the scenario of a RAG in mind, we first evaluated how a retrieval using the question as a query would retrieve the relevant page from the relevant report:\(question, report\)→relevant page\(\\textit\{question, report\}\)\\to\\textit\{relevant page\}\. Table[2](https://arxiv.org/html/2606.13100#S4.T2)reports this page\-level retrieval protocol, comparing lexical BM25\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2606.13100#bib.bib24)\)and the learned\-sparse SPLADE\(Formal et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib7)\)against the multi\-representation dense retriever ColBERT\(Khattab and Zaharia,[2020](https://arxiv.org/html/2606.13100#bib.bib13)\)\. ColBERT consistently outperforms both BM25 and SPLADE across all metrics\. This superior performance is expected because ColBERT uses token\-level late interaction to compute relevance scores from fine\-grained token\-to\-token alignments, rather than compressing the entire document into a single vector\. Locating the right page remains exceptionally difficult: ColBERT achieves an MRR of only 0\.475 \(Recall@5: 0\.370\), while SPLADE and BM25 drop to 0\.386 \(Recall@5: 0\.272\) and 0\.324 \(Recall@5: 0\.265\), respectively, confirming that dense numerical pages heavily defeat off\-the\-shelf retrievers\. Note that to run SPLADE and ColBERT, we use 4 GPUs to parallelize report\-level evaluation \(one model instance per GPU, with dynamic work\-stealing dispatch\) and an encoding batch size of 32 for both document pages and queries within each report\.
Table 2\.Per\-document page retrieval, averaged over 10,000 questions \(494 reports, Test corpus\)\. Recall@kk/MRR use binary relevance \(rel≥1\\mathrm\{rel\}\\\!\\geq\\\!1\); nDCG uses graded gains \(0/1/2\)\. Best per row inbold\.
### 4\.2\.Conversational long\-context extraction
With the scenario of a conversational chatbot in mind, we evaluated the long\-context information extraction given a question and its report:\(question, report\)→KPI\(\\textit\{question, report\}\)\\to\\textit\{KPI\}\. To do so, the model receives an entire OCR’d report \(≈\\approx100k tokens\) and extract a single specified KPI as a structured JSON object \(found,value,unit\_scale,page\)\. A response is*matched*within±0\.05%\\pm 0\.05\\%of ground truth label\. Because the long document prefix is constant per report and only the question varies, prefix caching cuts prefill by≈\\approx21×\\times, making full\-corpus evaluation tractable on a single GPU server\. The left block of Table[3](https://arxiv.org/html/2606.13100#S4.T3)reports four open\-weight models: the strongest \(Qwen3\.6\-27B\) reaches 91\.4% recall at 93\.5% precision, while a model with a systematic unit\-scaling error \(Nemotron\) collapses to 15\.0%\.
### 4\.3\.Comprehensive long\-context extraction
The hardest task asks a model to extract all 31 KPIs from a report in a single pass:report→all KPIs\\textit\{report\}\\to\\textit\{all KPIs\}\. We evaluated this task on ground\-truth labels by recall \(coverage of true facts\) and precision \(correctness of emitted facts\)\. The right block of Table[3](https://arxiv.org/html/2606.13100#S4.T3)shows that single\-value extraction capability does*not*transfer: Ministral, second\-best at the needle task \(87\.9%87\.9\\%\), collapses to41\.4%41\.4\\%recall under structured extraction, emitting many hallucinated cells; conversely Nemotron, near\-useless at the needle task, recovers to65\.5%65\.5\\%recall once schema constraints suppress its scaling error\. Qwen3\.6\-27B is the only model strong on both, and no model exceeds80%80\\%recall—establishing the task as an open challenge\.
Table 3\.LLM baselines for long\-context extraction on Test corpus, all as recall \(R\) and precision \(P\) with a match tolerance of±0\.05%\\pm 0\.05\\%\.*Conversational*: 10,000 questions\.*Comprehensive*: 494 reports, 13,519 cells\. Rows ordered by recall; best per column in bold\.
## 5\.Case study: Market Sentiment Prediction
We investigated the value of our corpus for a more challenging research question: whether the*rhetoric*in the reports carries signal about future fundamentals and market reaction\. To do so, we considered six highly liquid industries: specialty chemicals, auto parts, packaged foods, oil & gas exploration & production, oil & gas equipment & services, and mortgage REITs, spanning over the fiscal years 2017\-2022\. On the non\-10K reports of these sectors, we decided to evaluate the signal brought solely from the CEO letter\.
As models, we compared different backbone encoders, that we applied on the full tokens sequence, added a linear layer atop the averaged embeddings, with anL2L\_\{2\}regularization on the linear weights and frozen weights for the backbone model\. We trained the model to predict two targets: whether the returnr−90dr\_\{\-90d\}\(Equation[1](https://arxiv.org/html/2606.13100#S3.E1)\) falls in the top 10%, the bottom 10%, and similarly for the EPS Surprise \(Equation[2](https://arxiv.org/html/2606.13100#S3.E2)\), each framed as identifying the top/bottom decile \(i\.e\. a 10% prevalence\)\. Table[4](https://arxiv.org/html/2606.13100#S5.T4)reports PR\-AUC \(5\-fold stratified CV\) for ten encoders against a baseline with Multinomial Naive Bayes on a bag\-of\-words\. Several encoder/target combinations land well above the0\.100\.10random baseline—e\.g\.0\.470\.47for positive EPS surprises and0\.440\.44for negative 90\-day returns—indicating a genuine, if modest, textual signal and illustrating the kind of cross\-modal study \(narrative text→\\rightarrowrealized financials\) our corpus enables\.
Table 4\.PR\-AUC when predicting EPS surprise andr−90dr\_\{\-90d\}from the CEO\-letter \(top and low deciles\); best per column inbold\.
## 6\.Availability, license, and ethics
Availability\.The corpus, KPI tables, qrels, benchmark splits, model predictions, and the complete fetch/extract/validate/score toolchain are released under the MIT \(for all code productions\) license and CC\-BY\-4\.0 \(for all data\)\. All datasets are publicly available on HuggingFace \([https://huggingface\.co/collections/artefactory/ledger](https://huggingface.co/collections/artefactory/ledger)\) and all code production is available on our public GitHub repository \([https://github\.com/artefactory/LEDGER](https://github.com/artefactory/LEDGER)\)\. The release includes datasheets\(Gebru et al\.,[2021](https://arxiv.org/html/2606.13100#bib.bib8)\)and FAIR\-compliant\(Wilkinson et al\.,[2016](https://arxiv.org/html/2606.13100#bib.bib31)\)metadata, and reuses the standard TRECqrelsformat so existing IR tooling \(e\.g\.trec\_eval\) applies directly\.
Ethics and limitations\.All documents are public regulatory filings and investor reports; no personal data is involved\. KPI ground truth derives from official XBRL filings and third\-party financial APIs, and may contain restatements or vendor errors; we therefore publish provenance \(source and exact tag\) for every value and a human OCR\-quality audit \(Section[3](https://arxiv.org/html/2606.13100#S3)\), but the data is provided for research and is*not*investment advice\. OCR introduces errors in dense tables, which the annotation layer quantifies\. The benchmark is U\.S\.\-centric and bounded to 2017–2022 for market\-reaction data \(no EDGAR equivalent for non\-U\.S\. listings\); we intend to extend both the window and the company universe\.
Conclusion\.Ledgerturns 4,999 real annual reports into a corpus of labeled documents, evaluation suite spanning retrieval and conversational long\-context extraction\. Strong off\-the\-shelf systems leave clear headroom on every task, and the released toolchain lets the community extend the resource along the time, industry, country and modality axes\.
## References
- \(1\)
- Auer et al\.\(2007\)Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives\. 2007\.Dbpedia: A nucleus for a web of open data\. In*international semantic web conference*\. Springer, 722–735\.
- Boizard et al\.\(2025\)Nicolas Boizard, Hippolyte Gisserot\-Boukhlef, Duarte M Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, et al\.2025\.Eurobert: Scaling multilingual encoders for european languages\.*arXiv preprint arXiv:2503\.05500*\(2025\)\.
- Chen et al\.\(2024\)Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu\. 2024\.BGE M3\-Embedding: Multi\-Lingual, Multi\-Functionality, Multi\-Granularity Text Embeddings Through Self\-Knowledge Distillation\.arXiv:2402\.03216 \[cs\.CL\]
- Chen et al\.\(2021\)Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dillon Langdon, Reema Moussa, Matt Beane, Ting\-Hao Huang, Bryan Routledge, and William Yang Wang\. 2021\.FinQA: A Dataset of Numerical Reasoning over Financial Data\. In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*\.
- Chen et al\.\(2022\)Zhiyu Chen, Shiyang Li, Charese Smiley, Zhiqiang Ma, Sameena Shah, and William Yang Wang\. 2022\.ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering\. In*Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*\.
- Formal et al\.\(2021\)Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant\. 2021\.SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking\. In*Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval*\.
- Gebru et al\.\(2021\)Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford\. 2021\.Datasheets for Datasets\.*Commun\. ACM*64, 12 \(2021\), 86–92\.
- Google DeepMind \(2026\)Google DeepMind\. 2026\.Gemini 3\.1 Pro Model Card\.[https://storage\.googleapis\.com/deepmind\-media/Model\-Cards/Gemini\-3\-1\-Pro\-Model\-Card\.pdf](https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf)\.Accessed: 2026\-06\-01\.
- Hsieh et al\.\(2024\)Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg\. 2024\.RULER: What’s the Real Context Size of Your Long\-Context Language Models?\. In*First Conference on Language Modeling*\.[https://openreview\.net/forum?id=kIoBbc76Sy](https://openreview.net/forum?id=kIoBbc76Sy)
- Islam et al\.\(2023\)Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen\. 2023\.FinanceBench: A New Benchmark for Financial Question Answering\.arXiv:2311\.11944 \[cs\.CL\]
- Jimeno Yepes et al\.\(2024\)Antonio Jimeno Yepes, Yao You, Jan Milczek, Sebastian Laverde, and Renyu Li\. 2024\.Financial Report Chunking for Effective Retrieval Augmented Generation\.arXiv:2402\.05131 \[cs\.CL\]
- Khattab and Zaharia \(2020\)Omar Khattab and Matei Zaharia\. 2020\.ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT\. In*Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval*\. 39–48\.
- Liu et al\.\(2026\)Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al\.2026\.Ministral 3\.*arXiv preprint arXiv:2601\.08584*\(2026\)\.
- Liu et al\.\(2024\)Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang\. 2024\.Lost in the middle: How language models use long contexts\.*Transactions of the association for computational linguistics*12 \(2024\), 157–173\.
- Liu et al\.\(2019\)Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov\. 2019\.Roberta: A robustly optimized bert pretraining approach\.*arXiv preprint arXiv:1907\.11692*\(2019\)\.
- Loison et al\.\(2026\)António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, and Gautier Viaud\. 2026\.ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real\-World Scenarios\.*arXiv preprint arXiv:2601\.08620*\(2026\)\.
- Malherbe and Aufaure \(2016\)Emmanuel Malherbe and Marie\-Aude Aufaure\. 2016\.Bridge the terminology gap between recruiters and candidates: A multilingual skills base built from social media and linked data\. In*2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining \(ASONAM\)*\. IEEE, 583–590\.
- NVIDIA \(2025\)NVIDIA\. 2025\.Nemotron 3 Nano: Open, Efficient Mixture\-of\-Experts Hybrid Mamba\-Transformer Model for Agentic Reasoning\.[https://arxiv\.org/abs/2512\.20848](https://arxiv.org/abs/2512.20848)Technical report\.
- OpenAI \(2025\)OpenAI\. 2025\.gpt\-oss\-120b & gpt\-oss\-20b Model Card\.arXiv:2508\.10925 \[cs\.CL\][https://arxiv\.org/abs/2508\.10925](https://arxiv.org/abs/2508.10925)
- Qwen Team \(2026\)Qwen Team\. 2026\.Qwen3\.6\-27B: Flagship\-Level Coding in a 27B Dense Model\.[https://qwen\.ai/blog?id=qwen3\.6\-27b](https://qwen.ai/blog?id=qwen3.6-27b)
- Reddy et al\.\(2024\)Varshini Reddy, Rik Koncel\-Kedziorski, Viet Dac Lai, Michael Krumdick, Charles Lovering, and Chris Tanner\. 2024\.DocFinQA: A Long\-Context Financial Reasoning Dataset\. In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\), Short Papers*\.
- Reimers and Gurevych \(2019\)Nils Reimers and Iryna Gurevych\. 2019\.Sentence\-BERT: Sentence Embeddings using Siamese BERT\-Networks\. In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing*\. Association for Computational Linguistics\.[https://arxiv\.org/abs/1908\.10084](https://arxiv.org/abs/1908.10084)
- Robertson and Zaragoza \(2009\)Stephen Robertson and Hugo Zaragoza\. 2009\.The Probabilistic Relevance Framework: BM25 and Beyond\.*Foundations and Trends in Information Retrieval*3, 4 \(2009\), 333–389\.[doi:10\.1561/1500000019](https://doi.org/10.1561/1500000019)
- Song et al\.\(2020\)Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie\-Yan Liu\. 2020\.Mpnet: Masked and permuted pre\-training for language understanding\.*Advances in neural information processing systems*33 \(2020\), 16857–16867\.
- Vera et al\.\(2025\)Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al\.2025\.Embeddinggemma: Powerful and lightweight text representations\.*arXiv preprint arXiv:2509\.20354*\(2025\)\.
- Voorhees et al\.\(2005\)Ellen M Voorhees, Donna K Harman, et al\.2005\.*TREC: Experiment and evaluation in information retrieval*\. Vol\. 63\.MIT press Cambridge\.
- Warner et al\.\(2024\)Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli\. 2024\.Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference\.arXiv:2412\.13663 \[cs\.CL\][https://arxiv\.org/abs/2412\.13663](https://arxiv.org/abs/2412.13663)
- Wei et al\.\(2025\)Haoran Wei, Yaofeng Sun, and Yukun Li\. 2025\.DeepSeek\-OCR: Contexts Optical Compression\.*arXiv preprint arXiv:2510\.18234*\(2025\)\.
- Wei et al\.\(2026\)Haoran Wei, Yaofeng Sun, and Yukun Li\. 2026\.DeepSeek\-OCR 2: Visual Causal Flow\.*arXiv preprint arXiv:2601\.20552*\(2026\)\.
- Wilkinson et al\.\(2016\)Mark D\. Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, et al\.2016\.The FAIR Guiding Principles for Scientific Data Management and Stewardship\.*Scientific Data*3, 1 \(2016\), 160018\.
- Wu et al\.\(2023\)Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann\. 2023\.BloombergGPT: A Large Language Model for Finance\.arXiv:2303\.17564 \[cs\.LG\]
- Xiao et al\.\(2023\)Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff\. 2023\.C\-Pack: Packaged Resources To Advance General Chinese Embedding\.arXiv:2309\.07597 \[cs\.CL\]
- Zhu et al\.\(2021\)Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat\-Seng Chua\. 2021\.TAT\-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance\. In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics \(ACL\)*\.Similar Articles
ESGLens: An LLM-Based RAG Framework for Interactive ESG Report Analysis and Score Prediction
MIT researchers propose ESGLens, a RAG framework that extracts structured ESG data from PDF reports and predicts environmental scores with 0.48 Pearson correlation against LSEG benchmarks.
MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios
MacroLens is a new multi-task benchmark for contextual financial reasoning that jointly evaluates price history, accounting fundamentals, macroeconomic regimes, and textual data across 4,416 U.S. small- and micro-cap equities. It includes seven tasks, 1,130 macroeconomic events, and evaluations of 19 methods, aiming to fill a gap in financial AI evaluation.
LongWebBench: Evaluating Structural and Functional Webpage Generation in Long-Horizon Settings
LongWebBench is a benchmark for evaluating long-horizon webpage generation from both structural and functional perspectives, using VLM-based metrics and DOM-augmented agent-based pipelines. Experiments show current VLMs struggle with long-range coherence and executable interactions.
Improving the Completeness and Comparability of Segment Disclosures: A Large Language Model Approach
This paper proposes an LLM-based framework to extract segment disclosures from 10-K filings, improving completeness and comparability through retrieval-augmented systems for longitudinal and cross-firm analysis.
Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models
This paper introduces a retrieval-augmented LLM framework for financial sentiment analysis, achieving 15-48% improvement in accuracy and F1 score over traditional models and LLMs like ChatGPT and LLaMA.