Tag
The paper introduces MilliVid, a method for improving long-range consistency in video generation by using a multi-scale autoencoder to compress frames into hierarchical tokens and then generating them with a coarse-to-fine diffusion model, outperforming baselines on Minecraft videos.