vision-language-model

#vision-language-model

@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext

X AI KOLs Timeline ↗ · 2d ago Cached

docext is an on-premises toolkit that converts images and PDFs to markdown without OCR, leveraging vision-language models. It also introduces Nanonets-OCR-s, a compact 3B parameter model for efficient image-to-markdown conversion.

0 favorites 0 likes

#vision-language-model

MolmoAct 2

Product Hunt ↗ · 5d ago

MolmoAct 2 is an open robotics model that reasons in 3D space before taking actions, developed by the Allen Institute for Artificial Intelligence.

0 favorites 0 likes

#vision-language-model

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.

0 favorites 0 likes

#vision-language-model

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Reddit r/MachineLearning ↗ · 2026-04-20

SGOCR is an open-source dataset pipeline for generating spatially-grounded, OCR-focused visual question answering (VQA) tuples with rich metadata to support diverse VLM training. The pipeline uses a multi-stage approach combining models like Nvidia's nemotron-ocr-v2, Gemma4, Qwen3-VL, and Gemini-2.5-Flash, along with an agentic optimization loop.

0 favorites 0 likes

#vision-language-model

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Hugging Face Blog ↗ · 2026-03-31 Cached

IBM releases Granite 4.0 3B Vision, a compact vision-language model designed for enterprise document understanding, featuring specialized capabilities for table extraction, chart interpretation via ChartNet, and key-value pair grounding.

0 favorites 0 likes

#vision-language-model

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

Papers with Code Trending ↗ · 2025-09-26 Cached

MinerU2.5 is a 1.2B-parameter vision-language model that achieves state-of-the-art document parsing accuracy with high computational efficiency using a coarse-to-fine parsing strategy.

0 favorites 0 likes

#vision-language-model

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Papers with Code Trending ↗ · 2025-03-14 Cached

SmolDocling is a compact 256M parameter vision-language model designed for end-to-end multi-modal document conversion. It introduces a new universal markup format called DocTags to capture page elements with location, competing with models 27 times larger.

0 favorites 0 likes

vision-language-model

@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext

MolmoAct 2

@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Submit Feedback