MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Summary
MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.
View Cached Full Text
Cached at: 05/08/26, 08:38 AM
Paper page - MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
Source: https://huggingface.co/papers/2509.22186 Published on Sep 26, 2025
·
Submitted byhttps://huggingface.co/taesiri
taesirion Sep 29, 2025
#2 Paper of the day Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.
We introduce MinerU2.5, a 1.2B-parameterdocument parsingvision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs acoarse-to-fine,two-stage parsingstrategy that decouples globallayout analysisfrom localcontent recognition. In the first stage, the model performs efficient layout analysis ondownsampled imagesto identify structural elements, circumventing thecomputational overheadof processing high-resolution inputs. In the second stage, guided by the global layout, it performs targetedcontent recognitiononnative-resolution cropsextracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensivedata enginethat generates diverse, large-scale training corpora for bothpretrainingandfine-tuning. Ultimately, MinerU2.5 demonstrates strongdocument parsingability, achievingstate-of-the-art performanceon multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lowercomputational overhead.
View arXiv pageView PDFProject pageGitHub62.3kAdd to collection
Get this paper in your agent:
hf papers read 2509\.22186
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper6
#### opendatalab/MinerU2.5-2509-1.2B Image-Text-to-Text• 1B• Updated29 days ago • 1.49M • 356
#### opendatalab/MinerU-Diffusion-V1-0320-2.5B Image-to-Text• 3B• UpdatedMar 25 • 29.5k • 22
#### freakynit/MinerU2.5-2509-1.2B Image-Text-to-Text• 1B• UpdatedOct 15, 2025 • 7
#### Mungert/MinerU2.5-2509-1.2B-GGUF Image-Text-to-Text• 0.5B• UpdatedOct 20, 2025 • 1.91k
Browse 6 models citing this paper## Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2509.22186 in a dataset README.md to link it from this page.
Spaces citing this paper13
Collections including this paper22
Similar Articles
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
PaddleOCR-VL is a compact 0.9B vision-language model that achieves state-of-the-art performance in multilingual document parsing and element recognition by integrating NaViT-style dynamic resolution with the ERNIE language model.
opendatalab/MinerU
MinerU is an open-source tool by OpenDataLab for extracting data from PDFs and documents.
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.
baidu/Unlimited-OCR
Baidu releases Unlimited-OCR, a new model for one-shot long-horizon document parsing, building on Deepseek-OCR. It supports single image and multi-page/PDF parsing via Hugging Face Transformers and SGLang.
@oliviscusAI: You can now parse any document with one 1.7B parameter model It’s called dots-ocr. One system that handles text, tables…
The article introduces dots-ocr, a 1.7B parameter model capable of parsing text, tables, formulas, and images from documents in over 100 languages without needing separate OCR pipelines.