ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Hugging Face Daily Papers 04/21/26, 12:00 AM Papers

Summary

ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

Original Article

View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Source: https://huggingface.co/papers/2604.19720 Published on Apr 21

Submitted byhttps://huggingface.co/taited

taitedon Apr 23

Abstract

A pose- and viewpoint-controllable human video generation method combines image generation with SMPL-X motion guidance and video diffusion models to produce high-quality, temporally consistent videos.

Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned viaimage generationand used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone withSMPL-X-based motion guidance, together with a training-freetemporal refinementstage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release acanonical human datasetand an auxiliary model forcompositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

View arXiv page View PDF Project page GitHub5 Add to collection

Get this paper in your agent:

hf papers read 2604\.19720

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.19720 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.19720 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.19720 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Paper page - ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Long Video Generation (4 minute read)

Submit Feedback

Similar Articles

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Long Video Generation (4 minute read)