Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Summary
FramePack is a neural network structure that compresses input frames to fix transformer context length regardless of video length, enabling video diffusion models to process many frames with computation similar to image diffusion and improving batch sizes. It also introduces an anti-drifting sampling method to reduce exposure bias.
View Cached Full Text
Cached at: 05/18/26, 06:41 AM
Paper page - Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
Source: https://huggingface.co/papers/2504.12626 Papersarxiv:2504.12626
Published on Apr 17, 2025
·
Submitted byhttps://huggingface.co/BestWishYsh
YSHon Apr 18, 2025
#3 Paper of the day
Upvote 51- ![]()
Authors:
,
Abstract
FramePack, a neural network for video generation, compresses frames to manage transformer context length and enhances video diffusion models with increased batch sizes and improved frame prediction.
AI-generated summary
We present aneural networkstructure,FramePack, to train next-frame (or next-frame-section) prediction models for video generation. TheFramePackcompresses input frames to make thetransformercontext lengtha fixed number regardless of the video length. As a result, we are able to process a large number of frames usingvideo diffusionwithcomputation bottlenecksimilar toimage diffusion. This also makes the training videobatch sizes significantly higher (batch sizes become comparable toimage diffusiontraining). We also propose ananti-drifting samplingmethod that generates frames in inverted temporal order with early-established endpoints to avoidexposure bias(error accumulation over iterations). Finally, we show that existingvideo diffusionmodels can be finetuned withFramePack, and their visual quality may be improved because thenext-frame predictionsupports more balanced diffusion schedulers with less extremeflow shifttimesteps.
View arXiv pageView PDFProject pageGitHub16.9kAdd to collection
Community
![]()
Paper submitter
•
Code:https://github.com/lllyasviel/FramePack Page:https://lllyasviel.github.io/frame_pack_gitpage
![]()
This is an automated message from theLibrarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate(2025)
- Long-Context Autoregressive Video Modeling with Next-Frame Prediction(2025)
- HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models(2025)
- AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion(2025)
- Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks(2025)
- LongDiff: Training-Free Long Video Generation in One Go(2025)
- Long Context Tuning for Video Generation(2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkoutthisSpace
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:@librarian\-bot recommend
deleted
This comment has been hidden
Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.
Tap or paste here to upload images
Get this paper in your agent:
hf papers read 2504\.12626
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### URWAIFU/framepack-eichi-f1 UpdatedJul 8, 2025
Datasets citing this paper1
#### agreeupon/wrkspace-backup-ttl UpdatedJul 13, 2025 • 250
Spaces citing this paper14
📹⚡️ linoyts/FramePack-F1📹⚡️ makululinux/FramePack-F1📹⚡️ ObiJuanCodenobi/VidGen-Emilio🚀 jameschen414/FramePack📊 rajux75/FramePack🚀 YuliyaAether/FramePack-Demo📹⚡️ Dzlll/FramePackF1📹⚡️ inoculatemedia/FramePack-F1
Collections including this paper5
#### Video Generation Collection Video Generation• 51 items•Updated19 days ago • 2
#### Video Generation Backbone Models Collection 4 items•UpdatedApr 18, 2025 • 1
#### GenAI Collection 4 items•UpdatedMay 1, 2025
#### stuff i never have time to read Collection 13 items•UpdatedFeb 17
Similar Articles
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
LiteFrame: Efficient Vision Encoders Unlock Frame Scaling in Video LLMs
LiteFrame proposes a lightweight video encoder with Compressed Token Distillation training that reduces latency and enables processing 8x more frames for long-form video understanding in Video LLMs, improving accuracy while reducing compute.
FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder
FRAPPE is a novel autoencoding framework that uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. At high compression ratios, FRAPPE-Image achieves higher perceptual quality than AVIF with 47x faster encoding, making real-time 1080p 30fps CPU-only encoding possible.
PEEK: Picking Essential frames via Efficient Knowledge distillation
Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.
SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer
SANA-Video is a small diffusion model that efficiently generates high-resolution, long videos using linear attention and a constant-memory KV cache, achieving competitive performance at dramatically lower cost and faster speed compared to existing models.