ByteDance/Bernini-R

Hugging Face Models Trending Models

Summary

ByteDance open-sourced Bernini-R, a video diffusion renderer that combines an MLLM-based semantic planner with a DiT-based renderer for unified video generation and editing, achieving top-tier performance on video editing.

Task: image-text-to-video Tags: safetensors, bernini_renderer, image-text-to-video, arxiv:2605.22344, license:apache-2.0, region:us
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:39 AM

ByteDance/Bernini-R · Hugging Face

Source: https://huggingface.co/ByteDance/Bernini-R Bernini

Latent Semantic Planning for Video Diffusion

Chenchen Liu*, Junyi Chen*, Lei Li*, Lu Chi*,§, Mingzhen Sun*, Zhuoying Li*, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan✉

*Equal contribution✉Corresponding author§Project lead

arXivProject PageHuggingFace

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%8E%89-news🎉 News

https://huggingface.co/ByteDance/Bernini-R#%E2%9C%A8-highlights✨ Highlights

Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.

On video editing, Bernini reaches the first tier among leading closed-source commercial models. The leaderboard below comes from our self-built arena platform, where human annotators blindly vote on paired edits and the votes are aggregated into a Bradley-Terry score and a pairwise win-rate matrix.

Video editing arena: Bradley-Terry leaderboard and pairwise win-rate matrix

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%93%A6-installation📦 Installation

https://huggingface.co/ByteDance/Bernini-R#requirementsRequirements

  • Python3.11.2.
  • CUDA GPU— a Hopper GPU (H100/H800/H200) is recommended so FlashAttention-3 can be used; other CUDA GPUs fall back to FlashAttention-2 or PyTorch SDPA.
  • CUDA toolkit12.4 (matches the pinnedtorch==2\.5\.1\+cu124; 12.3+ is the minimum if you build FlashAttention-3).
  • Pinned inrequirements\.txt:torch==2\.5\.1\+cu124,diffusers==0\.35\.2,accelerate==0\.34\.2,transformers==4\.57\.3.

Reference environment (Bernini-R is developed and tested on this setup):

ComponentVersionGPUNVIDIA H100CUDA12.4Python3.11.2PyTorch2.5.1+cu124

https://huggingface.co/ByteDance/Bernini-R#installInstall

git clone https://github.com/bytedance/Bernini.git bernini && cd bernini
pip install -r requirements.txt

Optional extras:

  • Multi-GPU sequence parallelneedsOpen-VeOmni(Apache-2.0, Python 3.11). Use\-\-no\-depsso VeOmni does not pull in a different torch build and override the pinnedtorch==2\.5\.1\+cu124:pip install \-\-no\-deps git\+https://github\.com/ByteDance\-Seed/VeOmni\.git@v0\.1\.10. Single-GPU inference does not need it.
  • Faster attention(auto-detected if installed; otherwise PyTorch SDPA is used):- FlashAttention-2 — general CUDA GPUs (incl. A100/A800):pip install flash\-attn==2\.8\.3. - FlashAttention-3 — Hopper only (H100/H800/H200, CUDA ≥ 12.3, PyTorch ≥ 2.4).flash\_attn\_interfaceis not on PyPI; build it from theflash-attentionrepo’shopper/directory at tagv2\.8\.3:git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention && git checkout v2.8.3 cd hopper && MAX_JOBS=$(nproc) python3 setup.py install --user

https://huggingface.co/ByteDance/Bernini-R#weightsWeights

Bernini-R provides two ways to obtain the renderer weights. Thediffusers format is recommended— it is a self-contained diffusers-format directory whosetransformer/transformer\_2already hold the Bernini-R weights, so you point\-\-configat it and the weights load directly, withno\-\-high\_noise\_ckpt/\-\-low\_noise\_ckptneeded.

https://huggingface.co/ByteDance/Bernini-R#option-a–diffusers-format-recommendedOption A — diffusers format (recommended)

A single ready-to-use diffusers-format model fromByteDance/Bernini\-R\-Diffusers. It bundles the Wan2.2 base components (VAE, UMT5 text encoder, tokenizer) together with the Bernini-R transformer weights, so nothing else is downloaded at runtime.

pip install -U "huggingface_hub"
hf download ByteDance/Bernini-R-Diffusers --local-dir Bernini-R-Diffusers

Then pass it via\-\-configand omit the checkpoint flags, e.g.:

python infer_single_gpu.py --config Bernini-R-Diffusers \
    --case assets/testcases/t2i/t2i.json --num_frames 1

https://huggingface.co/ByteDance/Bernini-R#option-b–separate-checkpointsOption B — separate checkpoints

The original layout, where Bernini-R uses two sets of weights loaded separately:

  1. Wan2.2 baseWan\-AI/Wan2\.2\-T2V\-A14B\-Diffuserson Hugging Face. Supplies the VAE, UMT5 text encoder, tokenizer, and the transformer architecture/base weights. It is downloaded automatically on first run (configured bywan22\_baseinconfigs/bernini\_renderer\_wan22/config\.json).
  2. Bernini-R checkpoint— the trained high-noise / low-noise transformer weights (safetensors) fromByteDance/Bernini-R, passed with\-\-high\_noise\_ckpt/\-\-low\_noise\_ckpt. Both a local directory and a Hugging Face repo id are accepted.

Download models using huggingface-cli:

pip install -U "huggingface_hub"
hf download Wan-AI/Wan2.2-T2V-A14B-Diffusers --local-dir Wan2.2-T2V-A14B-Diffusers
hf download ByteDance/Bernini-R --local-dir Bernini-R

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%9A%80-usage🚀 Usage

A run is described by acase file— a small JSON underassets/testcases/that bundles one task’s routing and inputs (task\_type,guidance\_mode,prompt, source media,output). This keeps long prompts out of the command line. Each task has a directory underassets/testcases/holding one or more case files; seeassets/testcases/for the format and the bundledt2i/i2i/t2v/v2v/rv2v/r2vexamples.

https://huggingface.co/ByteDance/Bernini-R#prompt-enhancer-highly-recommendedPrompt enhancer (highly recommended)

\-\-use\_peenhances the prompt through an OpenAI-compatible endpoint and is recommended for best generation quality. TheopenaiSDK is installed byrequirements\.txt; configure the endpoint with environment variables:

export BERNINI_PE_API_KEY=...      # or OPENAI_API_KEY
export BERNINI_PE_BASE_URL=...     # or OPENAI_BASE_URL
export BERNINI_PE_MODEL=...        # vision-capable chat model

https://huggingface.co/ByteDance/Bernini-R#examples-by-task-typeExamples by task type

Unless an example specifies otherwise, inference outputs480p / 16fps(the defaults —\-\-max\_image\_size 848,\-\-fps 16).

Each example runs a bundled case inassets/testcases/— replace<hi\>/<lo\>with your high-/low-noise checkpoint paths. The image tasks (t2i,i2i) are shown on a single GPU; the video tasks on 8 GPUs viatorchrun, where\-\-ulysses Ngives N-way Ulysses sequence parallel per sample and the remainingworld\_size / Nranks run data parallel over the task list. The two scripts take the same inputs, so any example can be run either way.

Inputs can also be passed directly as flags instead of\-\-case(\-\-prompt,\-\-task\_type,\-\-guidance\_mode,\-\-video,\-\-image,\-\-images,\-\-output); generation parameters (\-\-seed,\-\-num\_frames, ...) are always command-line flags.

Text-to-image(t2i) — single GPU; generates one frame, so pass\-\-num\_frames 1

python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
    --case assets/testcases/t2i/t2i.json --num_frames 1

Image editing(i2i) — single GPU; generates one frame, so pass\-\-num\_frames 1

python infer_single_gpu.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> \
    --case assets/testcases/i2i/i2i.json --num_frames 1

Text-to-video(t2v)

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/t2v/t2v.json

Video editing(v2v/mv2v) — two cases are provided.

For edits where the main subject keeps its ordinary motion (case 1 adds a snowman to the scene), thev2vtask type is enough:

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/v2v/v2v_case1.json

For edits that need to change the subject’s motion (case 2 makes the person crouch down), themv2vtask type gives better results:

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/v2v/v2v_case2.json

Reference + video editing(rv2v) — two cases are provided.

Case 1 is reference-image-guided video editing — replacing a garment in the source video with one from a reference image:

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/rv2v/rv2v_case1.json

Case 2 is a video-insertion example — inserting content into the source video. It is run at 720p / 24fps to show the insertion result more clearly:

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/rv2v/rv2v_case2.json \
    --num_frames 121 --fps 24 --max_image_size 1280

Reference-to-video(r2v) — drives a video from one or more reference images

torchrun --nproc-per-node 8 infer_multi_gpu.py \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --ulysses 8 \
    --case assets/testcases/r2v/r2v.json

Seepython infer\_single\_gpu\.py \-\-helpfor the full argument list.

https://huggingface.co/ByteDance/Bernini-R#gradio-demoGradio demo

gradio\_demo\.pyexposes the same pipeline through a Gradio UI: the task-type dropdown auto-fillsguidance\_mode(still user-editable), uploaded media is routed to the matching slot, and the result is rendered inline.

# Single GPU
python gradio_demo.py --high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860

# 8 GPUs, 8-way Ulysses sequence parallel
torchrun --nproc-per-node 8 gradio_demo.py --ulysses 8 \
    --high_noise_ckpt <hi> --low_noise_ckpt <lo> --port 7860 --share

Add\-\-use\_pe(andexport OPENAI\_API\_KEY=\.\.\./BERNINI\_PE\_API\_KEY=\.\.\.) to enable GPT prompt enhancement; the in-UI checkbox is a per-request switch on top of this flag.

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%93%91-citation📑 Citation

If you use Bernini in your research, please cite:

@article{bernini,
  title   = {Bernini: Latent Semantic Planning for Video Diffusion},
  author  = {Chenchen Liu and Junyi Chen and Lei Li and Lu Chi and Mingzhen Sun and Zhuoying Li and Yi Fu and Ruoyu Guo and Yiheng Wu and Ge Bai and Zehuan Yuan},
  journal = {arXiv preprint arXiv:2605.22344},
  year    = {2026}
}

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%99%8F-acknowledgements🙏 Acknowledgements

Bernini builds on several outstanding open-source projects:

We thank the authors and communities of these projects for their contributions.

https://huggingface.co/ByteDance/Bernini-R#%F0%9F%93%84-license📄 License

Apache License 2.0. SeeLICENSE.

Similar Articles

bytedance-research/Lance

Hugging Face Models Trending

ByteDance Research introduces Lance, a 3B-parameter unified multimodal model trained from scratch on 128 A100 GPUs, capable of image and video understanding, generation, and editing within a single framework.

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.