@rohanpaul_ai: This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logi…
Summary
Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.
View Cached Full Text
Cached at: 06/17/26, 11:52 AM
This was long needed for AI in finance.
Making SEC filings readable for machines without flattening the accounting logic.
Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.
A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.
Has less than 0.1% overlap with Common Crawl-derived corpora.
The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.
The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens.
Link – arxiv. org/abs/2606.18192v1
Similar Articles
A modest proposal: Reformat everything to make documents more palatable to AI (5 minute read)
The LF AI & Data Foundation has formed a working group to develop DocLang, an AI-friendly document format backed by IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis, aiming to solve the problem of existing formats like PDF and HTML being ill-suited for AI parsing.
From data to decisions: how LSEG is scaling trusted AI
LSEG partners with OpenAI to scale trusted AI by grounding models in high-quality financial data, shrinking release cycles from months to two weeks, and expanding the analyst role from data processing to deeper insight.
@itsharmanjot: A TEAM OF AI RESEARCHERS JUST OPEN-SOURCED THE BLOOMBERG TERMINAL FOR QUANT FINANCE. A Bloomberg Terminal costs $25,000…
QuantMind, an open-source framework that ingests financial research papers, news, and SEC filings into a searchable knowledge graph, has been released on GitHub and accepted to NeurIPS 2025's GenAI in Finance Workshop, offering a free alternative to the Bloomberg Terminal.
@jerryjliu0: We built an AI agent for due diligence, with exact audit trails back to the source page, that you can use as a template…
LlamaIndex's Jerry Liu demonstrates building a financial due diligence AI agent with LiteParse, a free open-source PDF parser that provides exact citations and bounding box coordinates, enabling trust and transparency in agentic workflows.
@jerryjliu0: Many AI agents in finance rely on extremely high quality context engineering from documents They can be roughly divided…
Jerry Liu discusses how AI agents in finance rely on high-quality context engineering from documents, covering use cases like invoice processing and equity research, and shares workshop slides and a repository for building document parsing pipelines with human-in-the-loop review.