olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Summary
olmOCR is an open-source toolkit using a fine-tuned vision language model to extract clean text from PDFs while preserving structure, optimized for large-scale batch processing.
View Cached Full Text
Cached at: 06/28/26, 05:21 AM
Paper page - olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models
Source: https://huggingface.co/papers/2502.18443 Published on Feb 25, 2025
Abstract
olmOCR is an open-source toolkit using a fine-tuned vision language model to process PDFs into clean text while preserving structure, optimized for large-scale batch processing.
PDF documents have the potential to provide trillions of novel, high-quality tokens for training language models. However, these documents come in a diversity of types with differing formats and visual layouts that pose a challenge when attempting to extract and faithfully represent the underlying content for language model use. We present olmOCR, an open-source Python toolkit for processing PDFs into clean, linearized plain text in natural reading order while preserving structured content likesections,tables,lists,equations, and more. Our toolkit runs a fine-tuned 7B vision language model (VLM) trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties, including graphics, handwritten text and poor quality scans. olmOCR is optimized forlarge-scale batch processing, able to scale flexibly to different hardware setups and convert a million PDF pages for only $190 USD. We release all components of olmOCR including VLM weights, data and training code, as well as inference code built on serving frameworks includingvLLMandSGLang.
View arXiv pageView PDFProject pageGitHub17.5kautoAdd to collection
Get this paper in your agent:
hf papers read 2502\.18443
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2502.18443 in a model README.md to link it from this page.
Datasets citing this paper12
#### allenai/olmOCR-bench Benchmark• UpdatedFeb 19 • 6.3k • 248 #### shhdwi/olmocr-pre-rendered Viewer• UpdatedMar 2 • 1.34k • 2.02k #### Voxel51/olmOCR_bench Viewer• UpdatedFeb 24 • 1.4k • 1.99k #### introvoyz041/olmOCR-bench Preview• UpdatedMay 16 • 271 Browse 12 datasets citing this paper### Spaces citing this paper5
Collections including this paper4
Similar Articles
@hasantoxr: I found the OCR tool built for the LLM era. It is called olmOCR. olmOCR takes PDFs, scans, PNGs, and JPEGs and turns th…
olmOCR is an open-source OCR tool from Ai2 that converts PDFs, scans, and images into clean Markdown, designed to prepare documents for LLM pipelines by preserving reading order and handling complex layouts.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
A benchmark comparing vision-capable LLMs (native PDF reading) against OCR-based pipelines on 30 long, image-heavy PDFs finds that OCR with layout extraction still outperforms vision models on chart/table-heavy pages and has a 0% failure rate vs. 7% for native PDF, though the sample size is small and many gaps are within noise.
@oliviscusAI: You can now parse any document with one 1.7B parameter model It’s called dots-ocr. One system that handles text, tables…
The article introduces dots-ocr, a 1.7B parameter model capable of parsing text, tables, formulas, and images from documents in over 100 languages without needing separate OCR pipelines.
PaddlePaddle/PaddleOCR
PaddleOCR is a powerful, lightweight OCR toolkit that converts PDFs and images into structured data for AI applications, supporting 100+ languages and designed to bridge documents with LLMs.