SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Papers with Code Trending 03/14/25, 04:44 PM Papers

vision-language-model document-processing small-language-models ocr open-source hugging-face

Summary

SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal markup format that captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models, or ensemble solutions that rely on handcrafted pipelines of multiple specialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parameters vision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novel publicly sourced datasets for charts, tables, equations, and code recognition. Experimental results demonstrate that SmolDocling competes with other Vision Language Models that are up to 27 times larger in size, while reducing computational requirements substantially. The model is currently available, datasets will be publicly available soon.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/08/26, 08:43 AM

Paper page - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Source: https://huggingface.co/papers/2503.11576

Abstract

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

We introduce SmolDocling, an ultra-compactvision-language modeltargetingend-to-end document conversion. Our model comprehensively processes entire pages by generatingDocTags, a newuniversal markup formatthat captures allpage elementsin their full context with location. Unlike existing approaches that rely onlarge foundational models, orensemble solutionsthat rely on handcrafted pipelines of multiplespecialized models, SmolDocling offers an end-to-end conversion for accurately capturing content, structure and spatial location of document elements in a 256M parametersvision-language model. SmolDocling exhibits robust performance in correctly reproducing document features such ascode listings,tables,equations,charts,lists, and more across a diverse range of document types including business documents, academic papers, technical reports, patents, and forms -- significantly extending beyond the commonly observed focus on scientific papers. Additionally, we contribute novelpublicly sourced datasetsforcharts,tables,equations, and code recognition. Experimental results demonstrate that SmolDocling competes with otherVision Language Modelsthat are up to 27 times larger in size, while reducingcomputational requirementssubstantially. The model is currently available, datasets will be publicly available soon.

View arXiv page View PDF Project page GitHub59.4kauto Add to collection

Get this paper in your agent:

hf papers read 2503\.11576

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper16

#### docling-project/SmolDocling-256M-preview Image-Text-to-Text• UpdatedSep 17, 2025 • 29.8k • 1.61k #### ibm-granite/granite-docling-258M Image-Text-to-Text• 0.3B• UpdatedSep 23, 2025 • 283k • 1.17k #### docling-project/CodeFormulaV2 0.3B• UpdatedAug 11, 2025 • 68.4k • 4 #### prithivMLmods/granite-docling-258M-f32-GGUF Image-Text-to-Text• 0.2B• UpdatedNov 12, 2025 • 184 • 3 Browse 16 models citing this paper## Datasets citing this paper7

#### mnezama/SynthCodeNet Viewer• UpdatedJan 28 • 9.33M • 5.04k #### docling-project/SynthCodeNet Viewer• UpdatedJul 16, 2025 • 9.33M • 3.45k • 13 #### HuggingFaceM4/DoclingMatix Viewer• UpdatedJul 31, 2025 • 1.27M • 1.15k • 51 #### docling-project/SynthFormulaNet Viewer• UpdatedJul 31, 2025 • 6.45M • 967 • 17 Browse 7 datasets citing this paper### Spaces citing this paper24

Collections including this paper45

Browse 45 collections that include this paper

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Paper page - SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Abstract

Models citing this paper16

Collections including this paper45

Similar Articles

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext

@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…

Submit Feedback

Similar Articles

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext

@cjzafir: VLMs (Vertical Language Models) are beating top LLMs. These small 7B to 15B niche-focused models are beating SoTA model…