ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Hugging Face Daily Papers Papers

Summary

ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
Original Article
View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Source: https://huggingface.co/papers/2604.19720 Published on Apr 21

·

Submitted byhttps://huggingface.co/taited

taitedon Apr 23

Abstract

A pose- and viewpoint-controllable human video generation method combines image generation with SMPL-X motion guidance and video diffusion models to produce high-quality, temporally consistent videos.

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned viaimage generationand used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone withSMPL-X-based motion guidance, together with a training-freetemporal refinementstage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release acanonical human datasetand an auxiliary model forcompositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

View arXiv pageView PDFProject pageGitHub5Add to collection

Get this paper in your agent:

hf papers read 2604\.19720

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.19720 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.19720 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.19720 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

Hugging Face Daily Papers

Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Hugging Face Daily Papers

MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that uses vision-language reasoning to refine trajectories and a confidence-aware control scheme to improve plausibility, outperforming existing approaches on a new benchmark.

Long Video Generation (4 minute read)

TLDR AI

The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.