baidu/ERNIE-Image
Summary
Baidu releases ERNIE-Image, an open-weight text-to-image generation model with 8B parameters built on Diffusion Transformer architecture, achieving state-of-the-art performance among open-weight models with strong capabilities in text rendering, instruction following, and structured image generation.
View Cached Full Text
Cached at: 04/20/26, 02:45 PM
baidu/ERNIE-Image · Hugging Face
Source: https://huggingface.co/baidu/ERNIE-Image 🤗 ERNIE-Image|🤗 ERNIE-Image-Turbo|🤖 ERNIE-Image|🤖 ERNIE-Image-Turbo 🖥️ Huggingface Demo1|🖥️ Huggingface Demo2(ZeroGPU)|🖥️ AI Studio Demo Github|📖 Blog|🖼️ Art Gallery 💬 WeChat(微信)|🫨 Discord|🏷️ X
ERNIE-Image is an open text-to-image generation model developed by the ERNIE-Image team at Baidu. It is built on a single-stream Diffusion Transformer (DiT) and paired with a lightweight Prompt Enhancer that expands brief user inputs into richer structured descriptions. With only 8B DiT parameters, it reaches state-of-the-art performance among open-weight text-to-image models. The model is designed not only for strong visual quality, but also for controllability in practical generation scenarios where accurate content realization matters as much as aesthetics. In particular, ERNIE-Image performs strongly on complex instruction following, text rendering, and structured image generation, making it well suited for commercial posters, comics, multi-panel layouts, and other content creation tasks that require both visual quality and precise control. It also supports a broad range of visual styles, including realistic photography, design-oriented imagery, and more stylized aesthetic outputs.

Highlights:
- Compact but strong: Despite its compact 8B scale, ERNIE-Image remains highly competitive with substantially larger open-weight models across a range of benchmarks.
- Text rendering: ERNIE-Image performs particularly well on dense, long-form, and layout-sensitive text, making it a strong choice for posters, infographics, UI-like images, and other text-heavy visual content.
- Instruction following: The model is able to follow complex prompts involving multiple objects, detailed relationships, and knowledge-intensive descriptions with strong reliability.
- Structured generation: ERNIE-Image is especially effective for structured visual tasks such as posters, comics, storyboards, and multi-panel compositions, where layout and organization are critical.
- Style coverage: In addition to clean and readable design-oriented outputs, the model also supports realistic photography and distinctive stylized aesthetics, including softer and more cinematic visual tones.
- Practical deployment: Thanks to its compact size, ERNIE-Image can run on consumer GPUs with 24G VRAM, which lowers the barrier for research, downstream use, and model adaptation.
https://huggingface.co/baidu/ERNIE-Image#released-versionsReleased Versions
ERNIE-Image: OurSFT model, delivers stronger general-purpose capability and instruction fidelity in typically50 inference steps.
ERNIE-Image-Turbo: OurTurbo model, optimized byDMD and RL, achieves faster speed and higher aesthetics in only8 inference steps.
https://huggingface.co/baidu/ERNIE-Image#benchmarkBenchmark
https://huggingface.co/baidu/ERNIE-Image#genevalGENEval
ModelSingle ObjectTwo ObjectCountingColorsPositionAttribute BindingOverallERNIE-Image (w/o PE)1.00000.95960.77810.92820.85500.7925****0.8856ERNIE-Image (w/ PE)0.99060.95960.81870.88300.86250.72250.8728Qwen-Image0.99000.92000.89000.88000.76000.77000.8683ERNIE-Image-Turbo (w/o PE)1.0000****0.96210.79060.92020.79750.73000.8667ERNIE-Image-Turbo (w/ PE)0.99380.94190.83750.83510.79500.70250.8510FLUX.2-klein-9B0.93130.95710.82810.91490.71750.74000.8481Z-Image1.00000.94000.78000.93000.62000.77000.8400Z-Image-Turbo1.00000.95000.77000.89000.65000.68000.8233
https://huggingface.co/baidu/ERNIE-Image#oneig-enOneIG-EN
ModelAlignmentTextReasoningStyleDiversityOverallNano Banana 2.00.88800.94400.33400.48100.24500.5780Seedream 4.50.89100.99800.35000.43400.20700.5760ERNIE-Image (w/ PE)0.86780.97880.35660.43090.24110.5750Seedream 4.00.89200.98300.34700.45300.19100.5730ERNIE-Image-Turbo (w/ PE)0.86760.96660.35370.41910.22120.5656ERNIE-Image (w/o PE)0.89090.96680.29500.44710.16870.5537Z-Image0.88100.98700.28000.38700.19400.5460Qwen-Image0.88200.89100.30600.41800.19700.5390ERNIE-Image-Turbo (w/o PE)0.87950.94880.29130.42770.12320.5341FLUX.2-klein-9B0.88710.86570.31170.44170.15600.5324Qwen-Image-25120.87600.99000.29200.33800.15100.5300GLM-Image0.80500.96900.29800.35300.21300.5280Z-Image-Turbo0.84000.99400.29800.36800.13900.5280
https://huggingface.co/baidu/ERNIE-Image#oneig-zhOneIG-ZH
ModelAlignmentTextReasoningStyleDiversityOverallNano Banana 2.00.84300.98300.3110****0.46100.23600.5670ERNIE-Image (w/ PE)0.82990.95390.30560.43420.24780.5543Seedream 4.00.83600.98600.30400.44300.20000.5540Seedream 4.50.83200.98600.30000.42600.21300.5510Qwen-Image0.82500.96300.26700.40500.27900.5480ERNIE-Image-Turbo (w/ PE)0.82580.93860.30430.42080.22810.5435Z-Image0.79300.98800.26600.38600.24300.5350ERNIE-Image (w/o PE)0.84210.89790.26560.42120.17720.5208Qwen-Image-25120.82300.98300.27200.34200.15700.5150GLM-Image0.73800.97600.28400.33500.22100.5110Z-Image-Turbo0.78200.98200.27600.36100.13400.5070ERNIE-Image-Turbo (w/o PE)0.83260.90860.25800.40020.13160.5062FLUX.2-klein-9B0.82010.49200.25990.41660.16250.4302
https://huggingface.co/baidu/ERNIE-Image#longtextbenchLongTextBench
ModelLongText-Bench-ENLongText-Bench-ZHAvgSeedream 4.50.98900.98730.9882ERNIE-Image (w/ PE)0.98040.96610.9733GLM-Image0.95240.97880.9656ERNIE-Image-Turbo (w/ PE)0.96750.96360.9655Nano Banana 2.00.98080.94910.9650ERNIE-Image-Turbo (w/o PE)0.96020.96750.9639ERNIE-Image (w/o PE)0.96790.95940.9636Qwen-Image-25120.95610.96470.9604Qwen-Image0.94300.94600.9445Z-Image0.93500.93600.9355Seedream 4.00.92140.92610.9238Z-Image-Turbo0.91700.92600.9215FLUX.2-klein-9B0.86420.21830.5413
https://huggingface.co/baidu/ERNIE-Image#quick-startQuick Start
https://huggingface.co/baidu/ERNIE-Image#recommended-parametersRecommended Parameters
- Resolution:- 1024x1024 - 848x1264 - 1264x848 - 768x1376 - 896x1200 - 1376x768 - 1200x896
- Guidance scale: 4.0
- Inference steps: 50
https://huggingface.co/baidu/ERNIE-Image#diffusersDiffusers
pip install git\+https://github\.com/huggingface/diffusers
import torch
from diffusers import ErnieImagePipeline
pipe = ErnieImagePipeline.from_pretrained(
"Baidu/ERNIE-Image",
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe(
prompt="This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
height=1264,
width=848,
num_inference_steps=50,
guidance_scale=4.0,
use_pe=True # use prompt enhancer
).images[0]
image.save("output.png")
https://huggingface.co/baidu/ERNIE-Image#sglangSGLang
Install the latest version of sglang:
git clone https://github.com/sgl-project/sglang.git
Start the server:
sglang serve --model-path baidu/ERNIE-Image
Send a generation request:
curl -X POST http://localhost:30000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "This is a photograph depicting an urban street scene. Shot at eye level, it shows a covered pedestrian or commercial street. Slightly below the center of the frame, a cyclist rides away from the camera toward the background, appearing as a dark silhouette against backlighting with indistinct details. The ground is paved with regular square tiles, bisected by a prominent tactile paving strip running through the scene, whose raised textures are clearly visible under the light. Light streams in diagonally from the right side of the frame, creating a strong backlight effect with a distinct Tyndall effect—visible light beams illuminating dust or vapor in the air and casting long shadows across the street. Several pedestrians appear on the left side and in the distance, some with their backs to the camera and others walking sideways, all rendered as silhouettes or semi-silhouettes. The overall color palette is warm, dominated by golden yellows and dark browns, evoking the atmosphere of dusk or early morning.",
"height": 1264,
"width": 848,
"num_inference_steps": 50,
"guidance_scale": 4.0,
"use_pe": true
}' \
--output output.png
Similar Articles
baidu/ERNIE-Image-Turbo
Baidu releases ERNIE-Image-Turbo, a distilled text-to-image generation model that achieves fast generation in 8 inference steps while maintaining strong text rendering, instruction following, and structured image generation capabilities.
@heyshrutimishra: Baidu recently open-sourced ERNIE-Image, an 8B parameter model with weights available for commercial use. This is big. …
Baidu open-sourced ERNIE-Image, an 8B parameter text-to-image model with commercial-use weights, making it one of the few fully open and fine-tunable alternatives to closed models like Midjourney.
unsloth/ERNIE-Image-Turbo-GGUF
unsloth releases a GGUF quantized version of Baidu's ERNIE-Image-Turbo model using Unsloth Dynamic 2.0 methodology, enabling efficient text-to-image generation in 8 inference steps on consumer GPUs with 24GB VRAM.
New BEST local AI image generator is here!
Ernie Image, a new open-source diffusion model, surpasses Zage in text rendering and prompt fidelity and can be run locally via ComfyUI with ~20 GB VRAM.
New models possibly from Baidu (ERNIE) this month?
Speculative news about possible new ERNIE models from Baidu, hinted at via tweets and an upcoming Baidu Create 2026 event video.