numind/NuExtract3

Hugging Face Models Trending 2026/04/29 07:46 模型

document-understanding structured-extraction vision-language ocr multimodal reasoning open-source

摘要

NuExtract3 是一个 4B 参数规模的视觉-语言推理模型，用于文档理解，支持结构化提取和图像到 Markdown 的转换。

任务：image-to-text 标签：transformers, safetensors, qwen3_5, image-text-to-text, vision-language, vlm, document-understanding, structured-extraction, information-extraction, ocr, document-to-markdown, markdown, rag, reasoning, multilingual, conversational, image-to-text, 基础模型：Qwen/Qwen3.5-4B, 基础模型：finetune:Qwen/Qwen3.5-4B, 许可证：apache-2.0, endpoints_compatible, 区域：us

查看原文

查看缓存全文

缓存时间: 2026/05/22 07:44

numind/NuExtract3 · Hugging Face 来源: https://huggingface.co/numind/NuExtract3 🖥️API / 平台 (https://nuextract.ai/)| 📑博客 (https://numind.ai/blog)| 🗣️Discord (https://discord.gg/3tsEtJNCDe)| 🛠️GitHub (https://github.com/numindai/nuextract) NuExtract3是一个统一的4B视觉语言推理模型，用于文档理解。它结合了强大的结构化信息提取与高质量的图像到Markdown转换，适用于扫描件、收据、表单、发票、合同或表格等各类文档的提取管道、OCR和RAG预处理。在🤗空间试试吧！(https://huggingface.co/spaces/numind/NuExtract-3-4B) ## https://huggingface.co/numind/NuExtract3#overview 概述 - 结构化提取：输入（文本/图像）+ JSON模板 + 指令 —> JSON输出 - Markdown转换：输入（文本/图像） —> Markdown - 多模态输入：文本、图像或文本+图像。 - 多语言文档。 - 推理和非推理推断模式。 - 模板生成，用于从自然语言或输入文档进行结构化提取。 ## https://huggingface.co/numind/NuExtract3#benchmark-results 基准测试结果 ## https://huggingface.co/numind/NuExtract3#structured-extraction 结构化提取我们在NuMind的内部结构化基准测试上对NuExtract进行了评估，测量了模型在约600个不同类型的文档（包括发票、电影海报或平面图）上的性能。这些文档及其真实标注涵盖了多种使用场景，测试了模型的视觉理解、OCR、推理以及处理长输入和输出上下文的能力。我们计划在未来几周内开源该基准测试，同时附带一个包含大多数流行的开放权重和闭源API的广泛排行榜，以及一个便于测量模型在结构化提取上性能的Python库。为了衡量预测JSON和真实JSON对，我们将两者表示为树，基于节点名称进行对齐，计算对齐叶子的指标分数，并报告这些分数的平均值。`string`和`verbatim-string`叶子使用插入删除距离（即无替换的Levenshtein距离）进行评估，而其他所有叶子则使用精确匹配。模型使用vllm进行评估，温度为0.25，最大输出词元为65000（包括思考和回答），这远大于最大真实输出的词元数22000。模型名称 | 平均分数 | 失败数(1) | 思考平均词元数 | 答案平均词元数 — | — | — | — | — NuExtract3.4_4B-RL | 0.651 ± 0.019 | 27 | 2036 | 1856 gemma-4-E4B-it | 0.538 ± 0.023 | 31 | 3005 | 1287 Qwen3.5-9B | 0.479 ± 0.030 | 170 | 2240 | 91257 Qwen3.5-4B | 0.417 ± 0.031 | 229 | 2717 | 71201 GLM-4.6V-Flash | 0.435 ± 0.026 | 153 | 2989 | 1357 Nemotron-3-Nano-Omni | 0.387 ± 0.028 | 204 | 2582 | 7552 Ministral-3-3B | 0.240 ± 0.022 | 344 | 2758 | 6362 (1) 无法直接或通过移除前后反引号进行JSON反序列化的模型输出数量。 95%置信区间通过非参数自助法在分数分布上计算得出。该基准测试包含包含多个图像导致输入上下文较大的样本，以及真实标注包含大量待提取项导致输出较大的样本。我们发现小模型的推理能力显著降低了其性能。原因在于许多模型陷入重复循环，导致达到输出词元上限并请求失败。 ## https://huggingface.co/numind/NuExtract3#document-to-markdown 文档到Markdown NuExtract还可以将文档图像转换为整洁的Markdown。输出文本（标题等）为Markdown，表格为HTML，数学公式为LaTeX，针对复杂文档理解的现代、格式无关的基准测试很有限，因此我们探索了一种新的评估方法。我们选取了100个具有挑战性布局和表格的文档，要求每个模型将其转换为结构化表示，然后使用Gemini 3 Flash比较模型输出与源文档，并选择最准确的结果。排名结果与人工投票一致，表明这是一种有前景的评估文档到Markdown能力的方法。更多细节将在即将发布的技术报告中分享。以下是一些结果： ### https://huggingface.co/numind/NuExtract3#using-markdown-to-structured 使用"Markdown-to-structured" 为了增加其他评估参考，我们使用结构化提取基准测试以两步方式评估模型：先将基准测试输入转换为Markdown，然后使用Qwen3.6 27B对它们执行结构化提取任务。直观上，这可以评估模型如何保留输入文档的内容和布局：好的模型能让"结构化提取器"模型获得更高的分数。 ## https://huggingface.co/numind/NuExtract3#using-nuextract 使用NuExtract ## https://huggingface.co/numind/NuExtract3#structured-extraction-1 结构化提取结构化提取的输入包括： 1. 输入文档，可以是文本、图像或两者兼有； 2. 描述要提取信息的JSON模板； 3. （可选）指令，允许指定预期的输出格式或值，通过`instructions`聊天模板关键字参数提供； 4. （可选）上下文学习（ICL）示例。 ### https://huggingface.co/numind/NuExtract3#input-json-template 输入JSON模板 NuExtract使用一个输入JSON模板，其结构与输出JSON相同。其叶子值指定输出JSON叶子的类型。例如： { “invoice_number”: “verbatim-string”, “invoice_date”: “date”, “total_amount”: “number”, “currency”: “currency”, “line_items”: [ { “description”: “verbatim-string”, “item_type”: [“electronics”, “clothing”, “vehicle”, “furniture”, “other”], “quantity”: “integer”, “unit_price”: “number”, “total”: “number” } ] } 支持的模板类型包括： - `verbatim-string`：提取文档中出现的精确文本； - `string`：通用字符串字段，允许抽象或轻微改写； - `integer`：整数； - `number`：整数或小数； - `date-time`：ISO-8601日期、时间或日期时间； - 其他特定类型如`data`、`time`、`country`、`currency`、`email`等。更多详情，请阅读完整的类型规范和示例 (https://huggingface.co/numind/NuExtract3/blob/main/TYPES.md) 模板构造器： - 数组，例如`["string"]`； - 枚举，例如`["yes", "no", "maybe"]`； - 多枚举（多个可能值），例如`[["A", "B", "C"]]`。如果模型未找到字段的相关信息，则返回`null`或`[]`。 ### https://huggingface.co/numind/NuExtract3#converting-json-schema--pydantic-models-to-nuextract-template 将JSON schema / Pydantic模型转换为NuExtract模板我们的Python SDK（`pip install numind`）提供了一种将JSON schema转换为NuExtract模板的方法： from typing import Literal from pydantic import Field, BaseModel from numind.nuextract_utils import convert_json_schema_to_nuextract_template class HotelBooking(BaseModel): city: str check_in_date: str = Field(description=“date”) check_out_date: str = Field(description=“date”) number_of_guests: int room_type: Literal[“single”, “double”, “suite”] template, dropped_branches = convert_json_schema_to_nuextract_template( HotelBooking.model_json_schema() ) # {‘check_in_date’: ‘date’, ‘check_out_date’: ‘date’, ‘city’: ‘string’, ‘number_of_guests’: ‘integer’, ‘room_type’: [‘single’, ‘double’, ‘suite’]} `## https://huggingface.co/numind/NuExtract3#document-to-markdown-1 文档到Markdown NuExtract还可以将文档图像转换为整洁的Markdown。输出文本（标题等）为Markdown，表格为HTML，数学公式为LaTeX，` Markdown示例： `# COMMANDE NUMÉRO 72259 1 Vendu à TREMBLAY ERIC ERIC TREMBLAY 348 BOUL. DE L'ANSE ROBERVAL G8H 1Y9 Livré à TREMBLAY ERIC ERIC TREMBLAY 348 BOUL. DE L'ANSE ROBERVAL G8H 1Y9 # CLIENT EXPÉDITEUR TERME DE CRÉDIT DATE 2753133 Notre camion à la livraison 22/06/2023 NOM DU VENDEUR VOTRE ÉCONOMIE ! # COMMANDE Éric 0.00` — ## https://huggingface.co/numind/NuExtract3#reasoning-and-non-reasoning-modes 推理与非推理模式 NuExtract支持推理和非推理推断。 ### https://huggingface.co/numind/NuExtract3#non-thinking-mode 非思考模式用于快速、确定性的提取或Markdown转换。 `enable_thinking = False temperature = 0.2` ### https://huggingface.co/numind/NuExtract3#thinking-mode 思考模式用于困难文档、复杂布局、歧义字段或文档结构需要额外推理的情况。 `enable_thinking = True temperature = 0.6` 对于生产环境的提取工作负载，我们建议从非推理模式开始，仅在困难示例上启用推理。 — ## https://huggingface.co/numind/NuExtract3#vllm-deployment vLLM部署 NuExtract可以通过vLLM使用兼容OpenAI的API提供。 `vllm serve numind/NuExtract3 \ --trust-remote-code \ --limit-mm-per-prompt '{"image": 99, "video": 0}' \ --chat-template-content-format openai \ --generation-config vllm \ --max-model-len 131072 \ --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'` ### https://huggingface.co/numind/NuExtract3#multi-token-prediction 多词元预测上述部署命令通过vLLM推测解码启用了多词元预测（MTP）:`--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'` MTP可以提高解码吞吐量，而无需更改兼容OpenAI的请求负载。您可以根据硬件和工作负载调整`num_speculative_tokens`，如果您的vLLM版本或环境不支持此推测解码方法，请移除`--speculative-config`。如果遇到内存问题，请减少最大模型长度和最大图像数量： `vllm serve numind/NuExtract-3 \ --trust-remote-code \ --limit-mm-per-prompt '{"image": 6, "video": 0}' \ --chat-template-content-format openai \ --generation-config vllm \ --max-model-len 16384 \ --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'` ## https://huggingface.co/numind/NuExtract3#vllm-inference-structured-extraction-text vLLM推断：结构化提取：文本 import json from openai import OpenAI client = OpenAI( api_key="EMPTY", base_url="http://localhost:8000/v1", ) template = { "store": "verbatim-string", "date": "date-time", "total": "number", "currency": ["USD", "EUR", "GBP", "JPY", "Other"], "items": [ { "name": "verbatim-string", "price": "number" } ] } response = client.chat.completions.create( model="numind/NuExtract3", temperature=0.2, messages=[ { "role": "user", "content": [ { "type": "text", "text": "Yesterday I bought apples and coffee at Trader Joe's for a total of $12.40." } ], } ], extra_body={ "chat_template_kwargs": { "template": json.dumps(template), "instructions": "Specify the time for the `date` entry only if it is present, otherwise only output the date component.", "enable_thinking": False } } ) print(response.choices[0].message.content) 示例输出： `{ "store": "Trader Joe's", "date": null, "total": 12.40, "currency": "USD", "items": [ { "name": "apples", "price": null }, { "name": "coffee", "price": null } ] }` — ## https://huggingface.co/numind/NuExtract3#vllm-inference-structured-extraction-image vLLM推断：结构化提取：图像 `` import json import base64 from openai import OpenAI client = OpenAI( api_key=“EMPTY”, base_url=“http://localhost:8000/v1”, ) def encode_image(image_path): with open(image_path, “rb”) as image_file: return base64.b64encode(image_file.read()).decode(“utf-8”) image_base64 = encode_image(“receipt.png”) data_url = f“data:image/png;base64,{image_base64}“ template = { “store”: “verbatim-string”, “date”: “date-time”, “total”: “number”, “payment_method”: “verbatim-string” } response = client.chat.completions.create( model=“numind/NuExtract3”, temperature=0.2, messages=[ { “role”: “user”, “content”: [ { “type”: “image_url”, “image_url”: {“url”: data_url} } ], } ], extra_body={ “chat_template_kwargs”: { “template”: json.dumps(template, indent=4), “enable_thinking”: False } } )

相似文章

首批Gemma 4 12B微调模型已就绪

Reddit r/LocalLLaMA

首批Gemma 4 12B微调模型变体现在已在Hugging Face上线，由多位开发者提供GGUF格式。

本地测试了VoxCPM2（开源TTS）。“终极克隆”模式对呼吸和口音的捕捉效果令人惊叹。

Reddit r/ArtificialInteligence

对VoxCPM2的技术解析与基准测试，这是一款开源TTS模型，具备“终极克隆模式”以捕捉呼吸与口音。本地测试显示其低VRAM占用和跨语言口音保持能力。

@seclink: 这款拥有120亿参数的模型采用统一的 Transformer 架构，能够高效处理原始的多模态输入，且仅需 16GB 内存即可运行，完美适配 MacBook Pro 等设备。它在各项基准测试中表现卓越，例如在 GPQA Diamond 上…

X AI KOLs Following

一款120亿参数的多模态模型开源发布，采用统一Transformer架构，仅需16GB内存即可运行，在多项基准测试中表现优异，支持256K上下文窗口和140多种语言。

@vintcessun: 一早翻到一个有意思的项目，改变了我对面试准备的认知。一直以为大厂面试刷题就够了，但本质上它考察的是完整的计算机科学知识体系。这个项目把离散的知识点串成了一个系统计划，从 Big-O、数据结构、算法到系统设计、面试技巧全覆盖，甚至包含如何写…

X AI KOLs Timeline

A popular GitHub project providing a comprehensive multi-month study plan for software engineering interviews, covering CS fundamentals, algorithms, system design, and resume tips.

你的智能体能力取决于其框架。我开源了一个框架，单个函数调用背后集成了40项能力

Reddit r/AI_Agents

一个开源智能体框架，单个函数调用背后集成了40项能力，包括持久内存、Docker沙箱、自动摘要、死循环检测、预算上限和实时运行分支（用于分支智能体执行）。基于Pydantic AI构建，旨在替换每个生产级智能体所需的2000行胶水代码。

相似文章

首批Gemma 4 12B微调模型已就绪

本地测试了VoxCPM2（开源TTS）。“终极克隆”模式对呼吸和口音的捕捉效果令人惊叹。

@seclink: 这款拥有120亿参数的模型采用统一的 Transformer 架构，能够高效处理原始的多模态输入，且仅需 16GB 内存即可运行，完美适配 MacBook Pro 等设备。 它在各项基准测试中表现卓越，例如在 GPQA Diamond 上…

你的智能体能力取决于其框架。我开源了一个框架，单个函数调用背后集成了40项能力

提交意见反馈

@seclink: 这款拥有120亿参数的模型采用统一的 Transformer 架构，能够高效处理原始的多模态输入，且仅需 16GB 内存即可运行，完美适配 MacBook Pro 等设备。它在各项基准测试中表现卓越，例如在 GPQA Diamond 上…