Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Papers with Code Trending 04/17/25, 04:02 AM Papers

neural-network video-generation frame-prediction video-diffusion context-length transformer

Summary

FramePack is a neural network structure that compresses input frames to fix transformer context length regardless of video length, enabling video diffusion models to process many frames with computation similar to image diffusion and improving batch sizes. It also introduces an anti-drifting sampling method to reduce exposure bias.

We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

Original Article

View Cached Full Text

Cached at: 05/18/26, 06:41 AM

Paper page - Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Source: https://huggingface.co/papers/2504.12626 Papersarxiv:2504.12626

Published on Apr 17, 2025

Submitted byhttps://huggingface.co/BestWishYsh

YSHon Apr 18, 2025

#3 Paper of the day Upvote 51-

Authors:

Abstract

FramePack, a neural network for video generation, compresses frames to manage transformer context length and enhances video diffusion models with increased batch sizes and improved frame prediction.

AI-generated summary

We present aneural networkstructure,FramePack, to train next-frame (or next-frame-section) prediction models for video generation. TheFramePackcompresses input frames to make thetransformer context lengtha fixed number regardless of the video length. As a result, we are able to process a large number of frames usingvideo diffusionwithcomputation bottlenecksimilar toimage diffusion. This also makes the training videobatch sizes significantly higher (batch sizes become comparable toimage diffusiontraining). We also propose ananti-drifting samplingmethod that generates frames in inverted temporal order with early-established endpoints to avoidexposure bias(error accumulation over iterations). Finally, we show that existingvideo diffusionmodels can be finetuned withFramePack, and their visual quality may be improved because thenext-frame predictionsupports more balanced diffusion schedulers with less extremeflow shifttimesteps.