baidu/Unlimited-OCR

Hugging Face Models Trending 2026/06/19 09:40 模型

ocr document-parsing multimodal baidu huggingface open-source

摘要

百度发布了Unlimited-OCR，一种用于一次性长程文档解析的新模型，基于Deepseek-OCR构建。它支持通过Hugging Face Transformers和SGLang进行单图像和多页/PDF解析。

任务：图像文本到文本标签：transformers, safetensors, unlimited-ocr, image-feature-extraction, baidu, vision-language, ocr, custom_code, image-text-to-text, multilingual, license:mit, region:us

查看原文

查看缓存全文

缓存时间: 2026/06/22 19:33

baidu/Unlimited-OCR · Hugging Face 来源：https://huggingface.co/baidu/Unlimited-OCR 百度公司

Unlimited OCR 作品

GitHub (https://github.com/baidu/Unlimited-OCR) Hugging Face (https://huggingface.co/baidu/Unlimited-OCR) arXiv (https://huggingface.co/baidu/Unlimited-OCR/blob/main/Unlimited-OCR.pdf) Twitter 关注 (https://x.com/Baidu_Inc)

欢迎进入一次性长视野解析时代。

Unlimited OCR 概览

https://huggingface.co/baidu/Unlimited-OCR#release 发布

[2026/06/22] 🚀 我们推出了 Unlimited-OCR (https://github.com/baidu/Unlimited-OCR)，旨在将 Deepseek-OCR (https://https//github.com/deepseek-ai/DeepSeek-OCR) 向前再推进一步。

https://huggingface.co/baidu/Unlimited-OCR#inference 推理

https://huggingface.co/baidu/Unlimited-OCR#transformers Transformers

基于 Huggingface transformers 在 NVIDIA GPU 上进行推理。在 python 3.12.3 + CUDA12.9 环境下测试通过的需求：

torch==2.10.0 torchvision==0.25.0 transformers==4.57.1 Pillow==12.1.1 matplotlib==3.10.8 einops==0.8.2 addict==2.4.0 easydict==1.13 pymupdf==1.27.2.2 psutil==7.2.2

import os
import torch
from transformers import AutoModel, AutoTokenizer

model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()

# ── 单张图片支持两种配置：gundam 或 base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
    tokenizer,
    prompt='document parsing.',
    image_file='your_image.jpg',
    output_path='your/output/dir',
    base_size=1024,
    image_size=640,
    crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35,
    ngram_window=128,
    save_results=True,
)

# ── 多页 / PDF 仅使用 base 配置（image_size=1024）──
model.infer_multi(
    tokenizer,
    prompt='Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35,
    ngram_window=1024,
    save_results=True,
)

# ── PDF（将页面转换为图片，然后进行多页解析）──
import tempfile, fitz  # PyMuPDF

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='your/output/dir',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35,
    ngram_window=1024,
    save_results=True,
)

https://huggingface.co/baidu/Unlimited-OCR#sglang SGLang

设置环境（uv 管理的虚拟环境）。先安装本地 SGLang wheel，然后固定 kernels==0.9.0 并安装用于 PDF 转图片的 PyMuPDF：

uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2

启动 SGLang 服务：

python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --page-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --disable-overlap-schedule \
    --skip-server-warmup \
    --host 0.0.0.0 \
    --port 10000

向 OpenAI 兼容的 API 发送流式请求：

import base64
import json
import os
import tempfile
import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor

server_url = "http://127.0.0.1:10000"
session = requests.Session()
session.trust_env = False

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    image_paths = []
    for i, page in enumerate(doc):
        image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
        page.get_pixmap(matrix=mat).save(image_path)
        image_paths.append(image_path)
    doc.close()
    return image_paths

def encode_image(image_path):
    ext = os.path.splitext(image_path)[1].lower()
    mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
    with open(image_path, "rb") as f:
        data = base64.b64encode(f.read()).decode("utf-8")
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}

def build_content(prompt, image_paths):
    return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]

def generate(prompt, image_paths, image_mode, ngram_window):
    payload = {
        "model": "Unlimited-OCR",
        "messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
        "temperature": 0,
        "skip_special_tokens": False,
        "images_config": {"image_mode": image_mode},
        "custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
        "custom_params": {
            "ngram_size": 35,
            "window_size": ngram_window,
        },
        "stream": True,
    }
    response = session.post(
        f"{server_url}/v1/chat/completions",
        headers={"Content-Type": "application/json"},
        data=json.dumps(payload),
        timeout=1200,
        stream=True,
    )
    response.raise_for_status()
    chunks = []
    for line in response.iter_lines(chunk_size=1, decode_unicode=True):
        if not line or not line.startswith("data: "):
            continue
        data = line[len("data: "):]
        if data == "[DONE]":
            break
        event = json.loads(data)
        delta = event["choices"][0].get("delta", {}).get("content", "")
        if delta:
            print(delta, end="", flush=True)
            chunks.append(delta)
    print()
    return "".join(chunks)

# 单张图片支持两种配置：gundam 或 base。以下示例使用 gundam。
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)

# 多张图片（仅 base）
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)

# PDF（仅 base）
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)

https://huggingface.co/baidu/Unlimited-OCR#visualization 可视化

长视野 OCR 演示

https://huggingface.co/baidu/Unlimited-OCR#acknowledgement 致谢

我们感谢 Deepseek-OCR (https://https//github.com/deepseek-ai/DeepSeek-OCR)、Deepseek-OCR-2 (https://github.com/deepseek-ai/DeepSeek-OCR-2) 和 PaddleOCR (https://github.com/PaddlePaddle/PaddleOCR) 提供的宝贵模型和思路。

https://huggingface.co/baidu/Unlimited-OCR#citation 引用

即将推出！

相似文章

Unlimited OCR: 一次性长程解析

Hacker News Top

百度发布Unlimited-OCR，这是一个基于Deepseek-OCR构建的开源模型，用于一次性长程文档解析，支持单张图片、多页文档和PDF。

@geekbb: 百度开源的视觉语言模型 OCR 项目，在 DeepSeek-OCR 基础上做了升级，主打一次性解析超长文档。模型有两种推理模式：gundam 模式用来对付单张图里的密集文字，base 模式处理多页或 PDF。 https://github…

X AI KOLs Timeline

百度开源了视觉语言模型Unlimited-OCR，基于DeepSeek-OCR升级，支持一次性解析超长文档，提供gundam（单图密集文字）和base（多页/PDF）两种推理模式。

@vanstriendaniel: OCR模型又来了！百度公司的Unlimited-OCR是其中比较有趣的一个。你可以无需太多…

X AI KOLs Following

这篇文章展示了如何在Hugging Face Jobs上将百度的Unlimited-OCR模型作为临时的、兼容OpenAI的端点提供服务，支持多页文档解析，具有表格转HTML和公式转LaTeX提取等功能。

@ErickSky: 百度刚刚打破了当前OCR最大的限制之一。Unlimited-OCR一次性处理整个文档…

X AI KOLs Timeline

百度发布了Unlimited-OCR，它可以一次性处理整个文档而无需分块，克服了当前OCR技术的一个主要限制。

@GoSailGlobal: 现在的 OCR 处理几十页文档，是逐页跑的每翻一页，记忆清零一次百度今天悄悄在 GitHub 和 HuggingFace 开源了一个模型，叫 Unlimited OCR，它的灵感，来自人类抄书： - 你抄一本书，不会每写一个字就把前面…

X AI KOLs Timeline

百度开源了 Unlimited OCR 模型，采用参考滑动窗口注意力（R-SWA）机制，实现了对长达 32K 上下文的一次性文档解析，无需逐页推理。

baidu/Unlimited-OCR · Hugging Face 来源：https://huggingface.co/baidu/Unlimited-OCR 百度公司

Unlimited OCR 作品

欢迎进入一次性长视野解析时代。

https://huggingface.co/baidu/Unlimited-OCR#release 发布

https://huggingface.co/baidu/Unlimited-OCR#inference 推理

https://huggingface.co/baidu/Unlimited-OCR#transformers Transformers

https://huggingface.co/baidu/Unlimited-OCR#sglang SGLang

https://huggingface.co/baidu/Unlimited-OCR#visualization 可视化

https://huggingface.co/baidu/Unlimited-OCR#acknowledgement 致谢

https://huggingface.co/baidu/Unlimited-OCR#citation 引用

相似文章

Unlimited OCR: 一次性长程解析

@geekbb: 百度开源的视觉语言模型 OCR 项目，在 DeepSeek-OCR 基础上做了升级，主打一次性解析超长文档。模型有两种推理模式：gundam 模式用来对付单张图里的密集文字，base 模式处理多页或 PDF。 https://github…

@vanstriendaniel: OCR模型又来了！百度公司的Unlimited-OCR是其中比较有趣的一个。你可以无需太多…

@ErickSky: 百度刚刚打破了当前OCR最大的限制之一。Unlimited-OCR一次性处理整个文档…

@GoSailGlobal: 现在的 OCR 处理几十页文档，是逐页跑的 每翻一页，记忆清零一次 百度今天悄悄在 GitHub 和 HuggingFace 开源了一个模型，叫 Unlimited OCR，它的灵感，来自人类抄书： - 你抄一本书，不会每写一个字就把前面…

提交意见反馈

@GoSailGlobal: 现在的 OCR 处理几十页文档，是逐页跑的每翻一页，记忆清零一次百度今天悄悄在 GitHub 和 HuggingFace 开源了一个模型，叫 Unlimited OCR，它的灵感，来自人类抄书： - 你抄一本书，不会每写一个字就把前面…