SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Summary
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.
View Cached Full Text
Cached at: 05/08/26, 08:43 AM
Paper page - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion
Source: https://huggingface.co/papers/2503.11576
Abstract
SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.
We introduce SmolDocling, an ultra-compactvision-language modeltargetingend-to-end document conversion. Our model comprehensively processes entire pages by generatingDocTags, a newuniversal markup formatthat captures allpage elementsin their full context with location. Unlike existing approaches that rely onlarge foundational models, orensemble solutionsthat rely on handcrafted pipelines of multiplespecialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parametersvision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such ascode listings,tables,equations,charts,lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novelpublicly sourced datasetsforcharts,tables,equations, and code recognition. Experimental results demonstrate that SmolDocling competes with otherVision Language Modelsthat are up to 27 times larger in size, while reducingcomputational requirementssubstantially. The model is currently available, datasets will be publicly available soon.
View arXiv pageView PDFProject pageGitHub59.4kautoAdd to collection
Get this paper in your agent:
hf papers read 2503\.11576
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper16
#### docling-project/SmolDocling-256M-preview Image-Text-to-Text• UpdatedSep 17, 2025 • 29.8k • 1.61k
#### ibm-granite/granite-docling-258M Image-Text-to-Text• 0.3B• UpdatedSep 23, 2025 • 283k • 1.17k
#### docling-project/CodeFormulaV2 0.3B• UpdatedAug 11, 2025 • 68.4k • 4
#### prithivMLmods/granite-docling-258M-f32-GGUF Image-Text-to-Text• 0.2B• UpdatedNov 12, 2025 • 184 • 3
Browse 16 models citing this paper## Datasets citing this paper7
#### mnezama/SynthCodeNet Viewer• UpdatedJan 28 • 9.33M • 5.04k #### docling-project/SynthCodeNet Viewer• UpdatedJul 16, 2025 • 9.33M • 3.45k • 13 #### HuggingFaceM4/DoclingMatix Viewer• UpdatedJul 31, 2025 • 1.27M • 1.15k • 51 #### docling-project/SynthFormulaNet Viewer• UpdatedJul 31, 2025 • 6.45M • 967 • 17 Browse 7 datasets citing this paper### Spaces citing this paper24
Collections including this paper45
Similar Articles
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
PaddleOCR-VL is a compact 0.9B vision-language model that achieves state-of-the-art performance in multilingual document parsing and element recognition by integrating NaViT-style dynamic resolution with the ERNIE language model.
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.
@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext
docext is an on-premises toolkit that converts images and PDFs to markdown without OCR, leveraging vision-language models. It also introduces Nanonets-OCR-s, a compact 3B parameter model for efficient image-to-markdown conversion.
@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…
The author demonstrates that small vertical language models (6B-15B) can outperform top LLMs on niche benchmarks through cost-effective fine-tuning using open-source models and Codex orchestration, achieving results with a $300 dataset.