SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Summary
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.
View Cached Full Text
Cached at: 05/08/26, 08:43 AM
Paper page - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Source: https://huggingface.co/papers/2503.11576
Abstract
SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.
We introduce SmolDocling, an ultra-compactvision-language modeltargetingend-to-end document conversion. Our model comprehensively processes entire pages by generatingDocTags, a newuniversal markup formatthat captures allpage elementsin their full context with location. Unlike existing approaches that rely onlarge foundational models, orensemble solutionsthat rely on handcrafted pipelines of multiplespecialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parametersvision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such ascode listings,tables,equations,charts,lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novelpublicly sourced datasetsforcharts,tables,equations, and code recognition. Experimental results demonstrate that SmolDocling competes with otherVision Language Modelsthat are up to 27 times larger in size, while reducingcomputational requirementssubstantially. The model is currently available, datasets will be publicly available soon.
View arXiv pageView PDFProject pageGitHub59.4kautoAdd to collection
Get this paper in your agent:
hf papers read 2503\.11576
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper16
#### docling-project/SmolDocling-256M-preview Image-Text-to-Text• UpdatedSep 17, 2025 • 29.8k • 1.61k
#### ibm-granite/granite-docling-258M Image-Text-to-Text• 0.3B• UpdatedSep 23, 2025 • 283k • 1.17k
#### docling-project/CodeFormulaV2 0.3B• UpdatedAug 11, 2025 • 68.4k • 4
#### prithivMLmods/granite-docling-258M-f32-GGUF Image-Text-to-Text• 0.2B• UpdatedNov 12, 2025 • 184 • 3
Browse 16 models citing this paper## Datasets citing this paper7
#### mnezama/SynthCodeNet Viewer• UpdatedJan 28 • 9.33M • 5.04k #### docling-project/SynthCodeNet Viewer• UpdatedJul 16, 2025 • 9.33M • 3.45k • 13 #### HuggingFaceM4/DoclingMatix Viewer• UpdatedJul 31, 2025 • 1.27M • 1.15k • 51 #### docling-project/SynthFormulaNet Viewer• UpdatedJul 31, 2025 • 6.45M • 967 • 17 Browse 7 datasets citing this paper### Spaces citing this paper24
Collections including this paper45
Similar Articles
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
This paper presents dots.ocr, a unified Vision-Language Model that jointly learns layout detection, text recognition, and relational understanding for multilingual document layout parsing. It achieves state-of-the-art results on OmniDocBench and introduces the XDocParse benchmark spanning 126 languages.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
PaddleOCR-VL is a compact 0.9B vision-language model that achieves state-of-the-art performance in multilingual document parsing and element recognition by integrating NaViT-style dynamic resolution with the ERNIE language model.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.
@oliviscusAI: You can now parse any document with one 1.7B parameter model It’s called dots-ocr. One system that handles text, tables…
The article introduces dots-ocr, a 1.7B parameter model capable of parsing text, tables, formulas, and images from documents in over 100 languages without needing separate OCR pipelines.