Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Papers with Code Trending Papers

Summary

FramePack is a neural network structure that compresses input frames to fix transformer context length regardless of video length, enabling video diffusion models to process many frames with computation similar to image diffusion and improving batch sizes. It also introduces an anti-drifting sampling method to reduce exposure bias.

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:41 AM

Paper page - Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Source: https://huggingface.co/papers/2504.12626 Papersarxiv:2504.12626

Published on Apr 17, 2025

·

Submitted byhttps://huggingface.co/BestWishYsh

YSHon Apr 18, 2025

#3 Paper of the day Upvote 51-

Authors:

,

Abstract

FramePack, a neural network for video generation, compresses frames to manage transformer context length and enhances video diffusion models with increased batch sizes and improved frame prediction.

AI-generated summary

We present aneural networkstructure,FramePack, to train next-frame (or next-frame-section) prediction models for video generation. TheFramePackcompresses input frames to make thetransformercontext lengtha fixed number regardless of the video length. As a result, we are able to process a large number of frames usingvideo diffusionwithcomputation bottlenecksimilar toimage diffusion. This also makes the training videobatch sizes significantly higher (batch sizes become comparable toimage diffusiontraining). We also propose ananti-drifting samplingmethod that generates frames in inverted temporal order with early-established endpoints to avoidexposure bias(error accumulation over iterations). Finally, we show that existingvideo diffusionmodels can be finetuned withFramePack, and their visual quality may be improved because thenext-frame predictionsupports more balanced diffusion schedulers with less extremeflow shifttimesteps.

View arXiv pageView PDFProject pageGitHub16.9kAdd to collection

Community

BestWishYsh

Paper submitter

Apr 18, 2025

edited Apr 18, 2025

Code:https://github.com/lllyasviel/FramePack Page:https://lllyasviel.github.io/frame_pack_gitpage

librarian-bot

Apr 19, 2025

This is an automated message from theLibrarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkoutthisSpace

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:@librarian\-bot recommend

deleted

Apr 25, 2025

This comment has been hidden

Upload images, audio, and videos by dragging in the text input, pasting, orclicking here.

Tap or paste here to upload images

·Sign uporlog into comment

Upvote 51-

Get this paper in your agent:

hf papers read 2504\.12626

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### URWAIFU/framepack-eichi-f1 UpdatedJul 8, 2025

Datasets citing this paper1

#### agreeupon/wrkspace-backup-ttl UpdatedJul 13, 2025 • 250

Spaces citing this paper14

📹⚡️ linoyts/FramePack-F1📹⚡️ makululinux/FramePack-F1📹⚡️ ObiJuanCodenobi/VidGen-Emilio🚀 jameschen414/FramePack📊 rajux75/FramePack🚀 YuliyaAether/FramePack-Demo📹⚡️ Dzlll/FramePackF1📹⚡️ inoculatemedia/FramePack-F1

Collections including this paper5

#### Video Generation Collection Video Generation• 51 items•Updated19 days ago • 2

#### Video Generation Backbone Models Collection 4 items•UpdatedApr 18, 2025 • 1

#### GenAI Collection 4 items•UpdatedMay 1, 2025

#### stuff i never have time to read Collection 13 items•UpdatedFeb 17

Browse 5 collections that include this paper

Similar Articles

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Hugging Face Daily Papers

Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.

FRAPPE: Full Input, Residual Output Autoencoding with Projection Pursuit Encoder

Hugging Face Daily Papers

FRAPPE is a novel autoencoding framework that uses a projection pursuit encoder to predict residuals from full input, enabling efficient variable-rate image compression with fast CPU-based encoding. At high compression ratios, FRAPPE-Image achieves higher perceptual quality than AVIF with 47x faster encoding, making real-time 1080p 30fps CPU-only encoding possible.

PEEK: Picking Essential frames via Efficient Knowledge distillation

Hugging Face Daily Papers

Introduces PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a teacher model into a lightweight temporal model, outperforming state-of-the-art methods in video captioning while maintaining computational efficiency.