baidu/ERNIE-Image-Turbo

Hugging Face Models Trending 2026/04/02 10:57 模型

text-to-image diffusion-transformer open-source image-generation baidu fast-inference dit

摘要

百度发布了ERNIE-Image-Turbo，一个蒸馏文本到图像生成模型，可在8步推理中实现快速生成，同时保持强大的文本渲染、指令遵循和结构化图像生成能力。

任务：文本到图像 Tags: diffusers, safetensors, text-to-image, 8B, license:apache-2.0, diffusers:ErnieImagePipeline, region:us

查看原文

查看缓存全文

缓存时间: 2026/04/20 14:45

这是百度ERNIE-Image团队开发的开放文本到图像生成模型ERNIE-Image-Turbo的蒸馏版本，基于相同的单流扩散变换器（DiT）家族构建，仅需8步推理即可实现快速生成和强保真度。该模型在实际生成场景中保留了强大的可控性，既注重美学效果，也追求内容准确实现。特别是，ERNIE-Image-Turbo在复杂指令遵循、文本渲染和结构化图像生成方面表现强劲，非常适合海报、漫画、多格布局等既要求视觉质量又要求效率的内容创作任务。它还支持广泛的视觉风格，包括逼真的摄影、面向设计的图像以及风格化的美学输出。

亮点：

快速高效：作为ERNIE-Image的蒸馏检查点，ERNIE-Image-Turbo仅需8步推理即可提供出色的生成质量，适合对延迟敏感的应用。
文本渲染：ERNIE-Image-Turbo在处理密集、长文本和布局敏感的文本方面表现出色，是海报、信息图、UI类图像以及其他文本密集型视觉内容的理想选择。
指令遵循：该模型能够可靠地遵循涉及多个对象、详细关系和知识密集型描述的复杂提示。
结构化生成：ERNIE-Image-Turbo在结构化视觉任务中表现有效，例如海报、漫画、故事板和多格构图，其中布局和组织至关重要。
风格覆盖：除了清晰可读的设计导向输出外，该模型还支持逼真的摄影和独特风格化的美学，包括更柔和、更具电影感的视觉色调。
实用部署：得益于其紧凑的尺寸，ERNIE-Image-Turbo可在24G显存的消费级GPU上运行，降低了研究、下游使用和模型适配的门槛。

已发布版本

ERNIE-Image：我们的SFT模型，通常需要50步推理，提供更强的通用能力和指令保真度。

ERNIE-Image-Turbo：我们的Turbo模型，通过DMD和RL优化，仅需8步推理即可实现更快的速度和更高的美学效果。

基准测试

GENEval

模型	单对象	双对象	计数	颜色	位置	属性绑定	总体
ERNIE-Image (w/o PE)	1.0000	0.9596	0.7781	0.9282	0.8550	0.7925	0.8856
ERNIE-Image (w/ PE)	0.9906	0.9596	0.8187	0.8830	0.8625	0.7225	0.8728
Qwen-Image	0.9900	0.9200	0.8900	0.8800	0.7600	0.7700	0.8683
ERNIE-Image-Turbo (w/o PE)	1.0000	0.9621	0.7906	0.9202	0.7975	0.7300	0.8667
ERNIE-Image-Turbo (w/ PE)	0.9938	0.9419	0.8375	0.8351	0.7950	0.7025	0.8510
FLUX.2-klein-9B	0.9313	0.9571	0.8281	0.9149	0.7175	0.7400	0.8481
Z-Image	1.0000	0.9400	0.7800	0.9300	0.6200	0.7700	0.8400
Z-Image-Turbo	1.0000	0.9500	0.7700	0.8900	0.6500	0.6800	0.8233

OneIG-EN

模型	对齐	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8880	0.9440	0.3340	0.4810	0.2450	0.5780
Seedream 4.5	0.8910	0.9980	0.3500	0.4340	0.2070	0.5760
ERNIE-Image (w/ PE)	0.8678	0.9788	0.3566	0.4309	0.2411	0.5750
Seedream 4.0	0.8920	0.9830	0.3470	0.4530	0.1910	0.5730
ERNIE-Image-Turbo (w/ PE)	0.8676	0.9666	0.3537	0.4191	0.2212	0.5656
ERNIE-Image (w/o PE)	0.8909	0.9668	0.2950	0.4471	0.1687	0.5537
Z-Image	0.8810	0.9870	0.2800	0.3870	0.1940	0.5460
Qwen-Image	0.8820	0.8910	0.3060	0.4180	0.1970	0.5390
ERNIE-Image-Turbo (w/o PE)	0.8795	0.9488	0.2913	0.4277	0.1232	0.5341
FLUX.2-klein-9B	0.8871	0.8657	0.3117	0.4417	0.1560	0.5324
Qwen-Image-2512	0.8760	0.9900	0.2920	0.3380	0.1510	0.5300
GLM-Image	0.8050	0.9690	0.2980	0.3530	0.2130	0.5280
Z-Image-Turbo	0.8400	0.9940	0.2980	0.3680	0.1390	0.5280

OneIG-ZH

模型	对齐	文本	推理	风格	多样性	总体
Nano Banana 2.0	0.8430	0.9830	0.3110	0.4610	0.2360	0.5670
ERNIE-Image (w/ PE)	0.8299	0.9539	0.3056	0.4342	0.2478	0.5543
Seedream 4.0	0.8360	0.9860	0.3040	0.4430	0.2000	0.5540
Seedream 4.5	0.8320	0.9860	0.3000	0.4260	0.2130	0.5510
Qwen-Image	0.8250	0.9630	0.2670	0.4050	0.2790	0.5480
ERNIE-Image-Turbo (w/ PE)	0.8258	0.9386	0.3043	0.4208	0.2281	0.5435
Z-Image	0.7930	0.9880	0.2660	0.3860	0.2430	0.5350
ERNIE-Image (w/o PE)	0.8421	0.8979	0.2656	0.4212	0.1772	0.5208
Qwen-Image-2512	0.8230	0.9830	0.2720	0.3420	0.1570	0.5150
GLM-Image	0.7380	0.9760	0.2840	0.3350	0.2210	0.5110
Z-Image-Turbo	0.7820	0.9820	0.2760	0.3610	0.1340	0.5070
ERNIE-Image-Turbo (w/o PE)	0.8326	0.9086	0.2580	0.4002	0.1316	0.5062
FLUX.2-klein-9B	0.8201	0.4920	0.2599	0.4166	0.1625	0.4302

LongTextBench

模型	LongText-Bench-EN	LongText-Bench-ZH	平均
Seedream 4.5	0.9890	0.9873	0.9882
ERNIE-Image (w/ PE)	0.9804	0.9661	0.9733
GLM-Image	0.9524	0.9788	0.9656
ERNIE-Image-Turbo (w/ PE)	0.9675	0.9636	0.9655
Nano Banana 2.0	0.9808	0.9491	0.9650
ERNIE-Image-Turbo (w/o PE)	0.9602	0.9675	0.9639
ERNIE-Image (w/o PE)	0.9679	0.9594	0.9636
Qwen-Image-2512	0.9561	0.9647	0.9604
Qwen-Image	0.9430	0.9460	0.9445
Z-Image	0.9350	0.9360	0.9355
Seedream 4.0	0.9214	0.9261	0.9238
Z-Image-Turbo	0.9170	0.9260	0.9215
FLUX.2-klein-9B	0.8642	0.2183	0.5413

快速开始

Diffusers

安装最新版本的 diffusers：

pip install git+https://github.com/huggingface/diffusers

`` import torch from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained( “Baidu/ERNIE-Image-Turbo”, torch_dtype=torch.bfloat16, ).to(“cuda”)

image = pipe( prompt=“This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.”, height=1264, width=848, num_inference_steps=8, guidance_scale=1.0, use_pe=True # use prompt enhancer ).images[0]

image.save(“output.png”) ``

SGLang

安装最新版本的 sglang：

git clone https://github.com/sgl-project/sglang.git

启动服务器：

sglang serve --model-path baidu/ERNIE-Image-Turbo

发送生成请求：

curl -X POST http://localhost:30000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.", "height": 1264, "width": 848, "num_inference_steps": 8, "guidance_scale": 1.0, "use_pe": true }' \ --output output.png

baidu/ERNIE-Image-Turbo

已发布版本

基准测试

GENEval

OneIG-EN

OneIG-ZH

LongTextBench

快速开始

推荐参数

Diffusers

SGLang

相似文章

baidu/ERNIE-Image

unsloth/ERNIE-Image-Turbo-GGUF

@heyshrutimishra：百度最近开源了ERNIE-Image，80亿参数，权重可商用。意义重大。…

prunaai/z-image-turbo

Comfy-Org/ERNIE-Image

提交意见反馈