@tom_doerr: Converts images and PDFs to Markdown without OCR https://github.com/NanoNets/docext
Summary
docext is an on-premises toolkit that converts images and PDFs to markdown without OCR, leveraging vision-language models. It also introduces Nanonets-OCR-s, a compact 3B parameter model for efficient image-to-markdown conversion.
View Cached Full Text
Cached at: 05/08/26, 05:36 PM
Converts images and PDFs to Markdown without OCR
https://t.co/tI0EDdaWZR https://t.co/eFGzn4DLRa
NanoNets/docext
Source: https://github.com/NanoNets/docext
docext
An on-premises document information extraction and benchmarking toolkit.

New Model Release: Nanonets-OCR-s
We’re excited to announce the release of Nanonets-OCR-s, a compact 3B parameter model specifically trained for efficient image to markdown conversion with semantic understanding for images, signatures, watermarks, etc.!
📢 Read the full announcement | 🤗 Hugging Face model
Overview
docext is a comprehensive on-premises document intelligence toolkit powered by vision-language models (VLMs). It provides three core capabilities:
📄 PDF & Image to Markdown Conversion: Transform documents into structured markdown with intelligent content recognition, including LaTeX equations, signatures, watermarks, tables, and semantic tagging.
🔍 Document Information Extraction: OCR-free extraction of structured information (fields, tables, etc.) from documents such as invoices, passports, and other document types, with confidence scoring.
📊 Intelligent Document Processing Leaderboard: A comprehensive benchmarking platform that tracks and evaluates vision-language model performance across OCR, Key Information Extraction (KIE), document classification, table extraction, and other intelligent document processing tasks.
Features
PDF and Image to Markdown
Convert both PDF and images to markdown with content recognition and semantic tagging.
- LaTeX Equation Recognition: Convert both inline and block LaTeX equations in images to markdown.
- Intelligent Image Description: Generate a detailed description for all images in the document within
<img></img>tags. - Signature Detection: Detect and mark signatures and watermarks in the document. Signatures text are extracted within
<signature></signature>tags. - Watermark Detection: Detect and mark watermarks in the document. Watermarks text are extracted within
<watermark></watermark>tags. - Page Number Detection: Detect and mark page numbers in the document. Page numbers are extracted within
<page_number></page_number>tags. - Checkboxes and Radio Buttons: Converts form checkboxes and radio buttons into standardized Unicode symbols (☐, ☑, ☒).
- Table Detection: Convert complex tables into html tables.
🔍 For in-depth information, see the release blog.
For setup instructions and additional details, check out the full feature guide for the pdf to markdown.
Intelligent Document Processing Leaderboard
This benchmark evaluates performance across seven key document intelligence challenges:
- Key Information Extraction (KIE): Extract structured fields from unstructured document text.
- Visual Question Answering (VQA): Assess understanding of document content via question-answering.
- Optical Character Recognition (OCR): Measure accuracy in recognizing printed and handwritten text.
- Document Classification: Evaluate how accurately models categorize various document types.
- Long Document Processing: Test models’ reasoning over lengthy, context-rich documents.
- Table Extraction: Benchmark structured data extraction from complex tabular formats.
- Confidence Score Calibration: Evaluate the reliability and confidence of model predictions.
🔍 For in-depth information, see the release blog.
📊 Live leaderboard: https://idp-leaderboard.org
For setup instructions and additional details, check out the full feature guide for the Intelligent Document Processing Leaderboard.
Docext
- Flexible extraction: Define custom fields or use pre-built templates
- Table extraction: Extract structured tabular data from documents
- Confidence scoring: Get confidence levels for extracted information
- On-premises deployment: Run entirely on your own infrastructure (Linux, MacOS)
- Multi-page support: Process documents with multiple pages
- REST API: Programmatic access for integration with your applications
- Pre-built templates: Ready-to-use templates for common document types:
- Invoices
- Passports
- Add/delete new fields/columns for other templates.
For more details (Installation, Usage, and so on), please check out the feature guide.
Change Log
Latest Updates
- 12-06-2025 - Added pdf and image to markdown support.
- 06-06-2025 - Added
gemini-2.5-pro-preview-06-05evaluation metrics to the leaderboard. - 04-06-2025 - Added support for PDF and multiple documents in
docextextraction.
Older Changes
- 23-05-2025 – Added
gemini-2.5-pro-preview-03-25,claude-sonnet-4evaluation metrics to the leaderboard. - 17-05-2025 – Added
InternVL3-38B-Instruct,qwen2.5-vl-32b-instructevaluation metrics to the leaderboard. - 16-05-2025 – Added
gemma-3-27b-itevaluation metrics to the leaderboard. - 12-05-2025 – Added
Claude 3.7 sonnet,mistral-medium-3evaluation metrics to the leaderboard.
About
docext is developed by Nanonets, a leader in document AI and intelligent document processing solutions. Nanonets is committed to advancing the field of document understanding through open-source contributions and innovative AI technologies. If you are looking for information extraction solutions for your business, please visit our website to learn more.
Contributing
We welcome contributions! Please see contribution.md for guidelines. If you have a feature request or need support for a new model, feel free to open an issue—we’d love to discuss it further!
Troubleshooting
If you encounter any issues while using docext, please refer to our Troubleshooting guide for common problems and solutions.
License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Similar Articles
NuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]
Numind released NuExtract3, a 4B open-weight vision-language model based on Qwen3.5-4B, designed for converting document images to Markdown, OCR, and structured data extraction. It is Apache-2.0 licensed and self-hostable with quantized versions for low VRAM.
@hasantoxr: I found the OCR tool built for the LLM era. It is called olmOCR. olmOCR takes PDFs, scans, PNGs, and JPEGs and turns th…
olmOCR is an open-source OCR tool from Ai2 that converts PDFs, scans, and images into clean Markdown, designed to prepare documents for LLM pipelines by preserving reading order and handling complex layouts.
@tom_doerr: Converts documents and media into structured JSON for LLMs https://github.com/adithya-s-k/omniparse…
OmniParse is a local platform that ingests and parses unstructured data (documents, images, video, audio, web) into structured JSON optimized for LLM applications like RAG and fine-tuning.
PDFs in your workflow is burning around your 3xtokens , save them for free using Microsoft's Markitdown
Microsoft's Markitdown tool converts PDFs to markdown, saving tokens and cost when feeding documents to AI models like Claude, but requires caution with scanned PDFs, charts, and complex tables.
@BlockInsight214: Before feeding papers, contracts, or scanned documents to AI, the hardest step is often "cleaning up the PDF." These open-source projects specialize in that: converting to Markdown/JSON, ready for RAG or agents. ① MarkItDown · Microsoft, Office/PDF/images to Markdown in one click...
Introduces five open-source tools (MarkItDown, MinerU, Docling, marker, surya) that convert PDFs, Office documents, etc., into Markdown or JSON for direct use with RAG or AI agents.