baidu/ERNIE-Image

Hugging Face Models Trending 2026/04/07 07:26 模型

text-to-image diffusion-transformer open-source baidu ernie generative-ai

摘要

百度发布ERNIE-Image，这是一个基于扩散Transformer架构、拥有8B参数的开源权重文本到图像生成模型。它在开源模型中达到了最先进的性能，在文本渲染、指令跟随和结构化图像生成方面表现出色。

任务：text-to-image 标签：diffusers, safetensors, text-to-image, 8B, license:apache-2.0, diffusers:ErnieImagePipeline, region:us

查看原文

查看缓存全文

缓存时间: 2026/04/20 14:45

baidu/ERNIE-Image · Hugging Face

来源：https://huggingface.co/baidu/ERNIE-Image

🤗 ERNIE-Image (https://huggingface.co/Baidu/ERNIE-Image) | 🤗 ERNIE-Image-Turbo (https://huggingface.co/Baidu/ERNIE-Image-Turbo) | 🤖 ERNIE-Image (https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image/summary) | 🤖 ERNIE-Image-Turbo (https://www.modelscope.cn/models/PaddlePaddle/ERNIE-Image-Turbo/summary) 🖥️ Huggingface Demo1 (https://huggingface.co/spaces/baidu/ERNIE-Image-Turbo) | 🖥️ Huggingface Demo2(ZeroGPU) (https://huggingface.co/spaces/akhaliq/ERNIE-Image-Turbo) | 🖥️ AI Studio Demo (https://aistudio.baidu.com/ernieimage) Github (https://github.com/baidu/ernie-image) | 📖 Blog (https://yiyan.baidu.com/blog/posts/ernie-image) | 🖼️ Art Gallery (https://ernieimageprompt.com/) 💬 WeChat(微信) (https://github.com/baidu/ERNIE-Image/blob/main/assets/contacts/WeChat_small.jpg) | 🫨 Discord (https://discord.gg/ByUTbjfG5k) | 🏷️ X (https://x.com/ErnieforDevs)

ERNIE-Image 是由百度 ERNIE-Image 团队开发的开放文本到图像生成模型。它基于单流扩散变换器（DiT），并配有一个轻量级的提示增强器（Prompt Enhancer），可将简短的用户输入扩展为更丰富的结构化描述。尽管 DiT 参数量仅为 8B，但在开放权重文本到图像模型中达到了最先进的性能。该模型不仅追求强大的视觉质量，还注重实际生成场景中的可控性，在内容准确实现与美学同样重要的情况下表现出色。尤其是，ERNIE-Image 在复杂指令遵循、文本渲染和结构化图像生成方面表现强劲，非常适合商业海报、漫画、多面板布局以及其他需要视觉质量与精确控制的内容创作任务。它还支持广泛的视觉风格，包括写实摄影、设计导向图像以及更具风格化的美学输出。

ERNIE-Image 马赛克

亮点：

小巧但强大：尽管 ERNIE-Image 规模仅为 8B，但在多项基准测试中仍与参数量大得多的开源模型保持高度竞争力。
文本渲染：ERNIE-Image 在密集、长文本及布局敏感的文本处理方面表现尤为出色，是海报、信息图表、类似 UI 的图像及其他文本密集型视觉内容的理想选择。
指令跟随：该模型能够可靠地遵循涉及多个对象、复杂关系及知识密集型描述的复杂提示。
结构化生成：ERNIE-Image 在结构化视觉任务（如海报、漫画、故事板和多面板构图）中尤其有效，其中布局和组织至关重要。
风格覆盖：除了清晰可读的设计导向输出外，该模型还支持写实摄影和独特的美学风格，包括更柔和、更具电影感的视觉色调。
实用部署：得益于紧凑的尺寸，ERNIE-Image 可在具有 24G VRAM 的消费级 GPU 上运行，降低了研究、下游使用和模型适配的门槛。

https://huggingface.co/baidu/ERNIE-Image#released-versions 发布版本

ERNIE-Image (https://huggingface.co/Baidu/ERNIE-Image)：我们的 SFT 模型，通常在 50 步推理中提供更强的通用能力和指令忠实度。

ERNIE-Image-Turbo (https://huggingface.co/Baidu/ERNIE-Image-Turbo)：我们的 Turbo 模型，通过 DMD 和 RL 优化，仅需 8 步推理即可实现更快速度和更高美学质量。

https://huggingface.co/baidu/ERNIE-Image#benchmark 基准测试

https://huggingface.co/baidu/ERNIE-Image#geneval GENEval

模型	单个物体	两个物体	计数	颜色	位置	属性绑定	总体
ERNIE-Image (w/o PE)	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image (w/ PE)	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo (w/o PE)	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo (w/ PE)	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

https://huggingface.co/baidu/ERNIE-Image#oneig-en OneIG-EN

模型	对齐	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image (w/ PE)	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo (w/ PE)	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image (w/o PE)	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo (w/o PE)	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

https://huggingface.co/baidu/ERNIE-Image#oneig-zh OneIG-ZH

模型	对齐	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

https://huggingface.co/baidu/ERNIE-Image#longtextbench LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

https://huggingface.co/baidu/ERNIE-Image#quick-start 快速开始

https://huggingface.co/baidu/ERNIE-Image#recommended-parameters 推荐参数

分辨率：- 1024x1024 - 848x1264 - 1264x848 - 768x1376 - 896x1200 - 1376x768 - 1200x896
引导尺度：4.0
推理步数：50

https://huggingface.co/baidu/ERNIE-Image#diffusers Diffusers

pip install git+https://github.com/huggingface/diffusers

import torch
from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained(
    "Baidu/ERNIE-Image",
    torch_dtype=torch.bfloat16,
).to("cuda")

image = pipe(
    prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    height=1264,
    width=848,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True # 使用提示增强器
).images[0]

image.save("output.png")

https://huggingface.co/baidu/ERNIE-Image#sglang SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
    "height": 1264,
    "width": 848,
    "num_inference_steps": 50,
    "guidance_scale": 4.0,
    "use_pe": true

  }' \
  --output output.png

baidu/ERNIE-Image

baidu/ERNIE-Image · Hugging Face

https://huggingface.co/baidu/ERNIE-Image#released-versions 发布版本

https://huggingface.co/baidu/ERNIE-Image#benchmark 基准测试

https://huggingface.co/baidu/ERNIE-Image#geneval GENEval

https://huggingface.co/baidu/ERNIE-Image#oneig-en OneIG-EN

https://huggingface.co/baidu/ERNIE-Image#oneig-zh OneIG-ZH

https://huggingface.co/baidu/ERNIE-Image#longtextbench LongTextBench

https://huggingface.co/baidu/ERNIE-Image#quick-start 快速开始

https://huggingface.co/baidu/ERNIE-Image#recommended-parameters 推荐参数

https://huggingface.co/baidu/ERNIE-Image#diffusers Diffusers

https://huggingface.co/baidu/ERNIE-Image#sglang SGLang

相似文章

baidu/ERNIE-Image-Turbo

@heyshrutimishra：百度最近开源了ERNIE-Image，80亿参数，权重可商用。意义重大。…

unsloth/ERNIE-Image-Turbo-GGUF

最强本地AI图像生成器来了！

本月百度（ERNIE）可能推出新模型？

提交意见反馈