@rohanpaul_ai: This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logi…

X AI KOLs Following 06/17/26, 11:16 AM Papers

finance sec-filings llm-training dataset edgar financial-data table-structure

Summary

Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.

This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logic. Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables. A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents. Has less than 0.1% overlap with Common Crawl-derived corpora. The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training. The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens. ---- Link – arxiv. org/abs/2606.18192v1

Original Article

View Cached Full Text

Cached at: 06/17/26, 11:52 AM

This was long needed for AI in finance.

Making SEC filings readable for machines without flattening the accounting logic.

Stanford + Univ of Calif + Nanjing Univ researcher has just released a dataset and methods for a cleaner way to turn SEC filings into useful LLM training data without losing the meaning inside financial tables.

A 152B-token public snapshot and estimate the full archive could become about 550B tokens of long financial documents.

Has less than 0.1% overlap with Common Crawl-derived corpora.

The authors propose SEFD, a rebuilt version of EDGAR filings that keeps table structure, indentation, and financial meaning while using fewer tokens for LLM training.

The dataset turns EDGAR into layout-faithful MultiMarkdown, preserving merged headers, indentation, signs, spans, and table hierarchy while shrinking enormous presentation scaffolding into usable tokens.

Link – arxiv. org/abs/2606.18192v1

@rohanpaul_ai: This was long needed for AI in finance. Making SEC filings readable for machines without flattening the accounting logi…

Similar Articles

A modest proposal: Reformat everything to make documents more palatable to AI (5 minute read)

From data to decisions: how LSEG is scaling trusted AI

@itsharmanjot: A TEAM OF AI RESEARCHERS JUST OPEN-SOURCED THE BLOOMBERG TERMINAL FOR QUANT FINANCE. A Bloomberg Terminal costs $25,000…

@jerryjliu0: We built an AI agent for due diligence, with exact audit trails back to the source page, that you can use as a template…

@jerryjliu0: Many AI agents in finance rely on extremely high quality context engineering from documents They can be roughly divided…

Submit Feedback

Similar Articles

A modest proposal: Reformat everything to make documents more palatable to AI (5 minute read)

From data to decisions: how LSEG is scaling trusted AI

@itsharmanjot: A TEAM OF AI RESEARCHERS JUST OPEN-SOURCED THE BLOOMBERG TERMINAL FOR QUANT FINANCE. A Bloomberg Terminal costs $25,000…

@jerryjliu0: We built an AI agent for due diligence, with exact audit trails back to the source page, that you can use as a template…

@jerryjliu0: Many AI agents in finance rely on extremely high quality context engineering from documents They can be roughly divided…