Tag
docext is an on-premises toolkit that converts images and PDFs to markdown without OCR, leveraging vision-language models. It also introduces Nanonets-OCR-s, a compact 3B parameter model for efficient image-to-markdown conversion.
MolmoAct 2 is an open robotics model that reasons in 3D space before taking actions, developed by the Allen Institute for Artificial Intelligence.
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.
IBM releases Granite 4.0 3B Vision, a compact vision-language model designed for enterprise document understanding, featuring specialized capabilities for table extraction, chart interpretation via ChartNet, and key-value pair grounding.
MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.
SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.