Relit-LiVE: Relight Video by Jointly Learning Environment Video
Summary
This paper introduces Relit-LiVE, a novel video relighting framework that produces physically consistent results without requiring camera pose information by using raw reference images and joint environment video prediction.
View Cached Full Text
Cached at: 05/13/26, 12:13 PM
Paper page - Relit-LiVE: Relight Video by Jointly Learning Environment Video
Source: https://huggingface.co/papers/2605.06658
Abstract
A novel video relighting framework called Relit-LiVE is presented that produces physically consistent results without requiring camera pose information by incorporating raw reference images and using environment video prediction for joint relighting and environment map generation.
Recent advances have shown that large-scalevideo diffusion modelscan be repurposed asneural renderersby first decomposing videos intointrinsic scene representationsand then performingforward renderingunder novel illumination. While promising, this paradigm fundamentally relies on accurate intrinsic decomposition, which remains highly unreliable for real-world videos and often leads to distorted appearances, broken materials, and accumulated temporal artifacts during relighting. In this work, we present Relit-LiVE, a novelvideo relightingframework that produces physically consistent, temporally stable results without requiring prior knowledge of camera pose. Our key insight is to explicitly introduce raw reference images into the rendering process, enabling the model to recover critical scene cues that are inevitably lost or corrupted in intrinsic representations. Furthermore, we propose a novelenvironment video predictionformulation that simultaneously generates relit videos and per-frame environment maps aligned with each camera viewpoint in a singlediffusion process. This joint prediction enforces stronggeometric-illumination alignmentand naturally supports dynamic lighting and camera motion, significantly improving physical consistency invideo relightingwhile easing the requirement of known per-frame camera pose. Extensive experiments demonstrate that Relit-LiVE consistently outperforms state-of-the-artvideo relightingandneural renderingmethods across synthetic and real-world benchmarks. Beyond relighting, our framework naturally supports a wide range of downstream applications, including scene-level rendering, material editing, object insertion, and streamingvideo relighting. The Project is available at https://github.com/zhuxing0/Relit-LiVE.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.06658
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### weiqingXiao/Relit-LiVE Image-to-Video• Updated5 days ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.06658 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.06658 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting
This paper introduces WildRelight, a new real-world benchmark dataset for single-image relighting that addresses the gap between synthetic and natural scenes. It proposes a physics-guided adaptation framework using diffusion posterior sampling and test-time adaptation to improve model performance on real-world data.
ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
ReImagine introduces an image-first approach to controllable high-quality human video generation, combining SMPL-X motion guidance with video diffusion models to decouple appearance from temporal consistency.
CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models
CollabVR is a research paper proposing a closed-loop framework that collaboratively integrates vision-language models with video generation models to improve visual reasoning and correct failures in real-time.
Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction
Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.
EasyVideoR1: Easier RL for Video Understanding
EasyVideoR1 is an efficient reinforcement learning framework for training large vision-language models on video understanding tasks, featuring offline preprocessing with tensor caching for 1.47x throughput improvement, a task-aware reward system covering 11 problem types, and evaluation across 22 video benchmarks. It also supports joint image-video training and a mixed offline-online data training paradigm.