@XAMTO_AI: ControlNet作者敏神又搞出新东西了! 新开源的FramePack直接把视频生成的门槛打了下来——6GB显存就能跑,13B模型生成1分钟30帧视频,在RTX 4090上只要1.5秒出一帧,这配置要求放以前根本不敢想。 核心思路是逐帧…
摘要
ControlNet作者敏神开源了FramePack视频生成模型,仅需6GB显存即可运行13B模型,生成1分钟30帧视频,RTX 4090上每帧1.5秒,并提供Windows一键包。
查看缓存全文
缓存时间: 2026/06/09 10:45
ControlNet作者敏神又搞出新东西了!
新开源的FramePack直接把视频生成的门槛打了下来——6GB显存就能跑,13B模型生成1分钟30帧视频,在RTX 4090上只要1.5秒出一帧,这配置要求放以前根本不敢想。
核心思路是逐帧预测,一张图丢进去,1分钟连贯视频就出来了,人物动作、场景变化都能稳稳hold住。
Windows一键包已经准备好了,不用折腾环境。
https://github.com/lllyasviel/FramePack…
lllyasviel/FramePack
Source: https://github.com/lllyasviel/FramePack
FramePack
Official implementation and desktop software for “Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models”.
Links: Paper, Project Page
FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively.
FramePack compresses input contexts to a constant length so that the generation workload is invariant to video length.
FramePack can process a very large number of frames with 13B models even on laptop GPUs.
FramePack can be trained with a much larger batch size, similar to the batch size for image diffusion training.
Video diffusion, but feels like image diffusion.
News
2025 July 14: Some pure text2video anti-drifting stress-test results of FramePack-P1 are uploaded here, using common prompts without any reference images.
2025 June 26: Some results of FramePack-P1 are uploaded here. The FramePack-P1 will be the next version of FramePack with two designs: Planned Anti-Drifting and History Discretization.
2025 May 03: The FramePack-F1 is released. Try it here.
Note that this GitHub repository is the only official FramePack website. We do not have any web services. All other websites are spam and fake, including but not limited to framepack.co, frame_pack.co, framepack.net, frame_pack.net, framepack.ai, frame_pack.ai, framepack.pro, frame_pack.pro, framepack.cc, frame_pack.cc,framepackai.co, frame_pack_ai.co, framepackai.net, frame_pack_ai.net, framepackai.pro, frame_pack_ai.pro, framepackai.cc, frame_pack_ai.cc, and so on. Again, they are all spam and fake. Do not pay money or download files from any of those websites.
Requirements
Note that this repo is a functional desktop software with minimal standalone high-quality sampling system and memory management.
Start with this repo before you try anything else!
Requirements:
- Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
- Linux or Windows operating system.
- At least 6GB GPU memory.
To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB. (Yes 6 GB, not a typo. Laptop GPUs are okay.)
About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower. Troubleshoot if your speed is much slower than this.
In any case, you will directly see the generated frames since it is next-frame(-section) prediction. So you will get lots of visual feedback before the entire video is generated.
Installation
Windows:
>>> Click Here to Download One-Click Package (CUDA 12.6 + Pytorch 2.6) <<<
After you download, you uncompress, use update.bat to update, and use run.bat to run.
Note that running update.bat is important, otherwise you may be using a previous version with potential bugs unfixed.
Note that the models will be downloaded automatically. You will download more than 30GB from HuggingFace.
Linux:
We recommend having an independent Python 3.10.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt
To start the GUI, run:
python demo_gradio.py
Note that it supports --share, --port, --server, and so on.
The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.
For example, to install sage-attention (linux):
pip install sageattention==1.0.6
However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.
GUI
On the left you upload an image and write a prompt.
On the right are the generated videos and latent previews.
Because this is a next-frame-section prediction model, videos will be generated longer and longer.
You will see the progress bar for each section and the latent preview for the next section.
Note that the initial progress may be slower than later diffusion as the device may need some warmup.
Sanity Check
Before trying your own inputs, we highly recommend going through the sanity check to find out if any hardware or software went wrong.
Next-frame-section prediction models are very sensitive to subtle differences in noise and hardware. Usually, people will get slightly different results on different devices, but the results should look overall similar. In some cases, if possible, you’ll get exactly the same results.
Image-to-5-seconds
Download this image:
Copy this prompt:
The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.
Set like this:
(all default parameters, with teacache turned off)
The result will be:
| Video may be compressed by GitHub |
Important Note:
Again, this is a next-frame-section prediction model. This means you will generate videos frame-by-frame or section-by-section.
If you get a much shorter video in the UI, like a video with only 1 second, then it is totally expected. You just need to wait. More sections will be generated to complete the video.
Know the influence of TeaCache and Quantization
Download this image:
Copy this prompt:
The girl dances gracefully, with clear movements, full of charm.
Set like this:
Turn off teacache:
You will get this:
| Video may be compressed by GitHub |
Now turn on teacache:
About 30% users will get this (the other 70% will get other random results depending on their hardware):
| A typical worse result. |
So you can see that teacache is not really lossless and sometimes can influence the result a lot.
We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.
This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.
Image-to-1-minute
The girl dances gracefully, with clear movements, full of charm.
Set video length to 60 seconds:
If everything is in order you will get some result like this eventually.
60s version:
| Video may be compressed by GitHub |
6s version:
| Video may be compressed by GitHub |
More Examples
Many more examples are in Project Page.
Below are some more examples that you may be interested in reproducing.
The girl dances gracefully, with clear movements, full of charm.
| Video may be compressed by GitHub |
The girl suddenly took out a sign that said “cute” using right hand
| Video may be compressed by GitHub |
The girl skateboarding, repeating the endless spinning and dancing and jumping on a skateboard, with clear movements, full of charm.
| Video may be compressed by GitHub |
The girl dances gracefully, with clear movements, full of charm.
| Video may be compressed by GitHub |
The man dances flamboyantly, swinging his hips and striking bold poses with dramatic flair.
| Video may be compressed by GitHub |
The woman dances elegantly among the blossoms, spinning slowly with flowing sleeves and graceful hand movements.
| Video may be compressed by GitHub |
The young man writes intensely, flipping papers and adjusting his glasses with swift, focused movements.
| Video may be compressed by GitHub |
Prompting Guideline
Many people would ask how to write better prompts.
Below is a ChatGPT template that I personally often use to get prompts:
You are an assistant that writes short, motion-focused prompts for animating images.
When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.
Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).
Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."
If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.
Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.
You paste the instruct to ChatGPT and then feed it an image to get prompt like this:
The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.
Usually this will give you a prompt that works well.
You can also write prompts yourself. Concise prompts are usually preferred, for example:
The girl dances gracefully, with clear movements, full of charm.
The man dances powerfully, with clear movements, full of energy.
and so on.
Cite
@inproceedings{zhang2025framepack,
title={Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models},
author={Lvmin Zhang and Shengqu Cai and Muyang Li and Gordon Wetzstein and Maneesh Agrawala},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
}
@article{zhang2025framepackv1,
title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
author={Lvmin Zhang and Maneesh Agrawala},
journal={Arxiv},
year={2025}
}
相似文章
@VincentLogic: NVIDIA 这次真的不讲武德,直接甩出一个开源的视频理解怪兽 Nemotron 3 Nano Omni,处理视频快得离谱:1 小时就能搞定 10 小时的视频内容,比播放速度还快 10 倍 核心靠的是 3D 卷积技术,不再逐帧傻扫,而是成…
NVIDIA 开源了视频理解模型 Nemotron 3 Nano Omni,采用 3D 卷积技术,处理速度比播放速度快 10 倍,擅长音视频分析、监控检索和素材打标,但不适用于代码或文本推理任务。
@NFTCPS: 4GB显存跑70B大模型?这事儿真成了! AirLLM玩了个骚操作——分层推理,不一次性把模型怼进显存,而是一层层加载、算完就扔,硬生生把巨无霸塞进小破卡。 最骚的是:100%开源,白嫖警告 https://github.com/0xSo…
AirLLM 是一个完全开源的工具,通过分层推理技术(逐层加载并立即释放显存),使得 70B 大语言模型可在仅 4GB 显存的 GPU 上运行,无需量化、蒸馏或剪枝,并已支持 Llama3.1 405B 在 8GB 显存上运行。
@AI_jacksaku: GitHub本周黑马:Unsloth AI模型训练速度提升2-5倍, 显存占用减少80%。 这意味着什么? 以前微调一个大模型, 需要A100集群+几万美金。 现在一张4090, 几小时就能搞定。 Unsloth做了什么? 优化了注意力机…
Unsloth开源工具将大模型微调速度提升2-5倍、显存降低80%,使单张RTX 4090几小时完成原本需A100集群的任务。
@berryxia: Apple 一直其实在赌端侧模型的应用! 统一架构内存就是端侧模型的天然温床! 统一内存也就是,内存即显存。 也看到越来越多的优秀端侧模型出现。 OpenBMB 把 MiniCPM-V 4.6 这个 1.3B 的多模态模型放出来了,我看完…
OpenBMB 发布了 MiniCPM-V 4.6,一个 1.3B 参数的多模态模型,通过高分辨率视觉处理和高效压缩技术,在消费级硬件和手机上实现快速推理,性能超过同类大模型,且全面开源支持多种推理和量化框架。
@VincentLogic: NVIDIA 刚开源的这个 LocateAnything 模型,真的有点强。 以前那种视觉定位模型,生成坐标是一个数字一个数字往外蹦(像挤牙膏一样),又慢又不稳定。 这个新模型用了“并行边界框解码”,直接一步预测完整坐标,速度快多了,框得…
NVIDIA 开源了 LocateAnything 模型,采用并行边界框解码技术,一步预测完整坐标,速度快且准确。模型仅 3B 参数,可在消费级显卡上运行,支持视频物体定位、UI 识别和 OCR 等任务。