ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Summary
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Source: https://huggingface.co/papers/2604.19720 Published on Apr 21
·
Submitted byhttps://huggingface.co/taited
taitedon Apr 23
Abstract
A pose- and viewpoint-controllable human video generation method combines image generation with SMPL-X motion guidance and video diffusion models to produce high-quality, temporally consistent videos.
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned viaimage generationand used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone withSMPL-X-based motion guidance, together with a training-freetemporal refinementstage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release acanonical human datasetand an auxiliary model forcompositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
View arXiv pageView PDFProject pageGitHub5Add to collection
Get this paper in your agent:
hf papers read 2604\.19720
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.19720 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.19720 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.19720 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
CoInteract introduces an end-to-end Diffusion Transformer framework that jointly models RGB appearance and HOI geometry to generate physically-plausible human-object interaction videos with stable hands/faces and zero inference overhead.
VideoMDM: Towards 3D Human Motion Generation From 2D Supervision
VideoMDM trains 3D human motion priors from 2D poses using a diffusion framework with 2D reprojection loss and 3D motion regularizers, achieving near-3D supervised performance without requiring 3D ground truth.
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
MotiMotion introduces a reasoning-then-generation framework for motion-controlled video generation that uses vision-language reasoning to refine trajectories and a confidence-aware control scheme to improve plausibility, outperforming existing approaches on a new benchmark.
Long Video Generation (4 minute read)
The article introduces A²RD, a novel architecture for generating consistent long videos using agentic autoregressive diffusion. It proposes a Retrieve–Synthesize–Refine–Update cycle and a new benchmark, LVBench-C, to address semantic drift in long-horizon video synthesis.