@GoSailGlobal: 现在的 OCR 处理几十页文档,是逐页跑的 每翻一页,记忆清零一次 百度今天悄悄在 GitHub 和 HuggingFace 开源了一个模型,叫 Unlimited OCR,它的灵感,来自人类抄书: - 你抄一本书,不会每写一个字就把前面…
摘要
百度开源了 Unlimited OCR 模型,采用参考滑动窗口注意力(R-SWA)机制,实现了对长达 32K 上下文的一次性文档解析,无需逐页推理。
查看缓存全文
缓存时间: 2026/06/22 09:41
现在的 OCR 处理几十页文档,是逐页跑的 每翻一页,记忆清零一次
百度今天悄悄在 GitHub 和 HuggingFace 开源了一个模型,叫 Unlimited OCR,它的灵感,来自人类抄书:
- 你抄一本书,不会每写一个字就把前面几百页重读一遍
- 你只瞄一眼附近写到哪了,剩下的,做的是「软遗忘」
- 人就是靠这种低负荷的连续认知状态,扛住了长程任务
百度把这个直觉,做成了一个注意力机制 R-SWA(参考滑动窗口注意力): 每个 token 看得到整张图,但输出端只回看最近 128 个 token KV Cache 全程恒定,不随页数膨胀
结果是:在 32K 上下文下,一次前向推理,转录几十页文档,从逐页 for-loop,变成一口气抄完
逐页 for-loop 是工程权宜 连续的认知状态,才更像智能该有的样子
百度最近的路子,确实不太一样了
http://github.com/baidu/Unlimited-OCR… https://huggingface.co/baidu/Unlimited-OCR…
baidu/Unlimited-OCR
Source: https://github.com/baidu/Unlimited-OCR
Unlimited OCR Works
Welcome the Era of One-shot Long-horizon Parsing.
Release
- [2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.
Inference
Transformers
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9:
torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2
import os
import torch
from transformers import AutoModel, AutoTokenizer
model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
tokenizer,
prompt='<image>document parsing.',
image_file='your_image.jpg',
output_path='your/output/dir',
base_size=1024, image_size=640, crop_mode=True,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=128,
save_results=True,
)
# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=['page1.png', 'page2.png', 'page3.png'],
output_path='your/output/dir',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz # PyMuPDF
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
mat = fitz.Matrix(dpi / 72, dpi / 72)
paths = []
for i, page in enumerate(doc):
out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
page.get_pixmap(matrix=mat).save(out)
paths.append(out)
doc.close()
return paths
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=pdf_to_images('your_doc.pdf', dpi=300),
output_path='your/output/dir',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
SGLang
Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first,
then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:
uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2
Start the SGLang server:
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--attention-backend fa3 \
--page-size 1 \
--mem-fraction-static 0.8 \
--context-length 32768 \
--enable-custom-logit-processor \
--disable-overlap-schedule \
--skip-server-warmup \
--host 0.0.0.0 \
--port 10000
Send streaming requests to the OpenAI-compatible API:
import base64
import json
import os
import tempfile
import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor
server_url = "http://127.0.0.1:10000"
session = requests.Session()
session.trust_env = False
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
mat = fitz.Matrix(dpi / 72, dpi / 72)
image_paths = []
for i, page in enumerate(doc):
image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
page.get_pixmap(matrix=mat).save(image_path)
image_paths.append(image_path)
doc.close()
return image_paths
def encode_image(image_path):
ext = os.path.splitext(image_path)[1].lower()
mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
with open(image_path, "rb") as f:
data = base64.b64encode(f.read()).decode("utf-8")
return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}
def build_content(prompt, image_paths):
return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]
def generate(prompt, image_paths, image_mode, ngram_window):
payload = {
"model": "Unlimited-OCR",
"messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
"temperature": 0,
"skip_special_tokens": False,
"images_config": {"image_mode": image_mode},
"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
"custom_params": {
"ngram_size": 35,
"window_size": ngram_window,
},
"stream": True,
}
response = session.post(
f"{server_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=1200,
stream=True,
)
response.raise_for_status()
chunks = []
for line in response.iter_lines(chunk_size=1, decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data = line[len("data: "):]
if data == "[DONE]":
break
event = json.loads(data)
delta = event["choices"][0].get("delta", {}).get("content", "")
if delta:
print(delta, end="", flush=True)
chunks.append(delta)
print()
return "".join(chunks)
# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)
# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)
# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)
For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF:
# Image directory
python infer.py \
--image_dir ./examples/images \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
# PDF pages
python infer.py \
--pdf ./examples/document.pdf \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
Useful options:
--model_dir baidu/Unlimited-OCR # Local path or Hugging Face model ID
--gpu 0 # CUDA_VISIBLE_DEVICES value
--server_log ./log/sglang_server.log
Visualization
Acknowledgement
We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.
Citation
Coming soon!
相似文章
@geekbb: 百度开源的视觉语言模型 OCR 项目,在 DeepSeek-OCR 基础上做了升级,主打一次性解析超长文档。模型有两种推理模式:gundam 模式用来对付单张图里的密集文字,base 模式处理多页或 PDF。 https://github…
百度开源了视觉语言模型Unlimited-OCR,基于DeepSeek-OCR升级,支持一次性解析超长文档,提供gundam(单图密集文字)和base(多页/PDF)两种推理模式。
@berryxia: 卧槽,这一波直接把DeepSeek的“墙角挖到了啊”! 昨晚看到HuggingFace刷到这个有意思的OCR开源模型和原来背后有趣的故事。 这个OCR模型直接与传统的OCR模型完全不同! 光着速度和精准度真的就无敌了~~ 先说说背景,熟悉…
百度开源了Unlimited OCR模型,采用R-SWA注意力机制,可一次性处理数百页文档,无需分页,KV Cache恒定。该模型创新性地借鉴了人类抄书时的注意力模式,并与DeepSeek OCR有技术渊源,引发了对人才流动的关注。
baidu/Unlimited-OCR
百度发布了Unlimited-OCR,一种用于一次性长程文档解析的新模型,基于Deepseek-OCR构建。它支持通过Hugging Face Transformers和SGLang进行单图像和多页/PDF解析。
Unlimited OCR: 一次性长程解析
百度发布Unlimited-OCR,这是一个基于Deepseek-OCR构建的开源模型,用于一次性长程文档解析,支持单张图片、多页文档和PDF。
@vanstriendaniel: OCR模型又来了!百度公司的Unlimited-OCR是其中比较有趣的一个。你可以无需太多…
这篇文章展示了如何在Hugging Face Jobs上将百度的Unlimited-OCR模型作为临时的、兼容OpenAI的端点提供服务,支持多页文档解析,具有表格转HTML和公式转LaTeX提取等功能。