baidu/ERNIE-Image-Turbo

Hugging Face Models Trending 模型

摘要

百度发布了ERNIE-Image-Turbo,一个蒸馏文本到图像生成模型,可在8步推理中实现快速生成,同时保持强大的文本渲染、指令遵循和结构化图像生成能力。

任务:文本到图像 Tags: diffusers, safetensors, text-to-image, 8B, license:apache-2.0, diffusers:ErnieImagePipeline, region:us
查看原文
查看缓存全文

缓存时间: 2026/04/20 14:45

这是百度ERNIE-Image团队开发的开放文本到图像生成模型ERNIE-Image-Turbo的蒸馏版本,基于相同的单流扩散变换器(DiT)家族构建,仅需8步推理即可实现快速生成和强保真度。该模型在实际生成场景中保留了强大的可控性,既注重美学效果,也追求内容准确实现。特别是,ERNIE-Image-Turbo在复杂指令遵循、文本渲染和结构化图像生成方面表现强劲,非常适合海报、漫画、多格布局等既要求视觉质量又要求效率的内容创作任务。它还支持广泛的视觉风格,包括逼真的摄影、面向设计的图像以及风格化的美学输出。

亮点:

  • 快速高效:作为ERNIE-Image的蒸馏检查点,ERNIE-Image-Turbo仅需8步推理即可提供出色的生成质量,适合对延迟敏感的应用。
  • 文本渲染:ERNIE-Image-Turbo在处理密集、长文本和布局敏感的文本方面表现出色,是海报、信息图、UI类图像以及其他文本密集型视觉内容的理想选择。
  • 指令遵循:该模型能够可靠地遵循涉及多个对象、详细关系和知识密集型描述的复杂提示。
  • 结构化生成:ERNIE-Image-Turbo在结构化视觉任务中表现有效,例如海报、漫画、故事板和多格构图,其中布局和组织至关重要。
  • 风格覆盖:除了清晰可读的设计导向输出外,该模型还支持逼真的摄影和独特风格化的美学,包括更柔和、更具电影感的视觉色调。
  • 实用部署:得益于其紧凑的尺寸,ERNIE-Image-Turbo可在24G显存的消费级GPU上运行,降低了研究、下游使用和模型适配的门槛。

已发布版本

ERNIE-Image:我们的SFT模型,通常需要50步推理,提供更强的通用能力和指令保真度。

ERNIE-Image-Turbo:我们的Turbo模型,通过DMD和RL优化,仅需8步推理即可实现更快的速度和更高的美学效果。

基准测试

GENEval

模型单对象双对象计数颜色位置属性绑定总体
ERNIE-Image (w/o PE)1.00000.95960.77810.92820.85500.79250.8856
ERNIE-Image (w/ PE)0.99060.95960.81870.88300.86250.72250.8728
Qwen-Image0.99000.92000.89000.88000.76000.77000.8683
ERNIE-Image-Turbo (w/o PE)1.00000.96210.79060.92020.79750.73000.8667
ERNIE-Image-Turbo (w/ PE)0.99380.94190.83750.83510.79500.70250.8510
FLUX.2-klein-9B0.93130.95710.82810.91490.71750.74000.8481
Z-Image1.00000.94000.78000.93000.62000.77000.8400
Z-Image-Turbo1.00000.95000.77000.89000.65000.68000.8233

OneIG-EN

模型对齐文本推理风格多样性总体
Nano Banana 2.00.88800.94400.33400.48100.24500.5780
Seedream 4.50.89100.99800.35000.43400.20700.5760
ERNIE-Image (w/ PE)0.86780.97880.35660.43090.24110.5750
Seedream 4.00.89200.98300.34700.45300.19100.5730
ERNIE-Image-Turbo (w/ PE)0.86760.96660.35370.41910.22120.5656
ERNIE-Image (w/o PE)0.89090.96680.29500.44710.16870.5537
Z-Image0.88100.98700.28000.38700.19400.5460
Qwen-Image0.88200.89100.30600.41800.19700.5390
ERNIE-Image-Turbo (w/o PE)0.87950.94880.29130.42770.12320.5341
FLUX.2-klein-9B0.88710.86570.31170.44170.15600.5324
Qwen-Image-25120.87600.99000.29200.33800.15100.5300
GLM-Image0.80500.96900.29800.35300.21300.5280
Z-Image-Turbo0.84000.99400.29800.36800.13900.5280

OneIG-ZH

模型对齐文本推理风格多样性总体
Nano Banana 2.00.84300.98300.31100.46100.23600.5670
ERNIE-Image (w/ PE)0.82990.95390.30560.43420.24780.5543
Seedream 4.00.83600.98600.30400.44300.20000.5540
Seedream 4.50.83200.98600.30000.42600.21300.5510
Qwen-Image0.82500.96300.26700.40500.27900.5480
ERNIE-Image-Turbo (w/ PE)0.82580.93860.30430.42080.22810.5435
Z-Image0.79300.98800.26600.38600.24300.5350
ERNIE-Image (w/o PE)0.84210.89790.26560.42120.17720.5208
Qwen-Image-25120.82300.98300.27200.34200.15700.5150
GLM-Image0.73800.97600.28400.33500.22100.5110
Z-Image-Turbo0.78200.98200.27600.36100.13400.5070
ERNIE-Image-Turbo (w/o PE)0.83260.90860.25800.40020.13160.5062
FLUX.2-klein-9B0.82010.49200.25990.41660.16250.4302

LongTextBench

模型LongText-Bench-ENLongText-Bench-ZH平均
Seedream 4.50.98900.98730.9882
ERNIE-Image (w/ PE)0.98040.96610.9733
GLM-Image0.95240.97880.9656
ERNIE-Image-Turbo (w/ PE)0.96750.96360.9655
Nano Banana 2.00.98080.94910.9650
ERNIE-Image-Turbo (w/o PE)0.96020.96750.9639
ERNIE-Image (w/o PE)0.96790.95940.9636
Qwen-Image-25120.95610.96470.9604
Qwen-Image0.94300.94600.9445
Z-Image0.93500.93600.9355
Seedream 4.00.92140.92610.9238
Z-Image-Turbo0.91700.92600.9215
FLUX.2-klein-9B0.86420.21830.5413

快速开始

推荐参数

  • 分辨率:- 1024x1024 - 848x1264 - 1264x848 - 768x1376 - 896x1200 - 1376x768 - 1200x896
  • 引导尺度:1.0
  • 推理步数:8

Diffusers

安装最新版本的 diffusers:

pip install git+https://github.com/huggingface/diffusers

`` import torch from diffusers import ErnieImagePipeline

pipe = ErnieImagePipeline.from_pretrained( “Baidu/ERNIE-Image-Turbo”, torch_dtype=torch.bfloat16, ).to(“cuda”)

image = pipe( prompt=“This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.”, height=1264, width=848, num_inference_steps=8, guidance_scale=1.0, use_pe=True # use prompt enhancer ).images[0]

image.save(“output.png”) ``

SGLang

安装最新版本的 sglang:

git clone https://github.com/sgl-project/sglang.git

启动服务器:

sglang serve --model-path baidu/ERNIE-Image-Turbo

发送生成请求:

curl -X POST http://localhost:30000/v1/images/generations \ -H "Content-Type: application/json" \ -d '{ "prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.", "height": 1264, "width": 848, "num_inference_steps": 8, "guidance_scale": 1.0, "use_pe": true }' \ --output output.png

相似文章

baidu/ERNIE-Image

Hugging Face Models Trending

百度发布ERNIE-Image,这是一个基于扩散Transformer架构、拥有8B参数的开源权重文本到图像生成模型。它在开源模型中达到了最先进的性能,在文本渲染、指令跟随和结构化图像生成方面表现出色。

unsloth/ERNIE-Image-Turbo-GGUF

Hugging Face Models Trending

unsloth 发布了基于百度的 ERNIE-Image-Turbo 模型的 GGUF 量化版本,采用 Unsloth Dynamic 2.0 方法,能够在配备 24GB 显存的消费级 GPU 上通过 8 步推理高效实现文生图。

prunaai/z-image-turbo

Replicate Explore

阿里巴巴60亿参数的Z-Image-Turbo文生图模型,经PrunaAI进一步压缩,可在8步扩散下于1秒内生成1024×1024双语文字照片级图像。

Comfy-Org/ERNIE-Image

Hugging Face Models Trending

Comfy-Org 将百度的 ERNIE-Image 和 ERNIE-Image-Turbo 模型重新打包以集成到 ComfyUI 中,提供了为 ComfyUI 基于节点的图像生成框架组织的即用模型文件。