Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models
Summary
Light Interaction introduces a training-free inference acceleration framework for interactive video world models, using adaptive context management, denoising cache acceleration, and 3D block sparse attention to achieve up to 2.59x speedup while maintaining competitive visual quality.
View Cached Full Text
Cached at: 06/01/26, 03:17 AM
Paper page - Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models
Source: https://huggingface.co/papers/2605.31158
Abstract
Light Interaction accelerates interactive video world models through adaptive computation strategies and optimized attention mechanisms without requiring model retraining.
Interactive videoworld models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework forinteractive videoworld models. Our key insight is that interaction naturally enablestrajectory-dependent adaptive computation: retrievedspatial memorycan be discarded during novel exploration,temporal contextcan be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management,denoising cache acceleration, and hardware-software co-designed3D block sparse attentionwith fusedTriton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2605\.31158
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31158 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31158 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31158 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Video outpainting is getting really good
Video outpainting techniques have significantly advanced, enabling better extrapolation of video content beyond original boundaries.
Inside Google DeepMind: Reasoning, Omni, and Shipping Frontier AI
This article summarizes a deep discussion among three Google DeepMind researchers on reasoning, multimodal generation (Omni), coding, and self-improvement, emphasizing that visual and dynamic thinking will surpass text-based chain-of-thought, and explores future trends in world models and synthetic training cases.
Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
Proposes Adwm, an autoregressive diffusion world model for off-policy evaluation of LLM agents, enabling reliable value estimates from pre-collected trajectories without online interaction.
@berryxia: Holy shit! Huang is amazing! Now I can directly bookmark the HTML to easily create videos. I was also tinkering with hyperframe and remotion for videos today. Now I can use it directly, it's like a pillow delivered just when needed! Link: https://github.com/nexu-io/ope…
Open Design is an open-source alternative to Claude Design, supporting video, prototype, and dashboard generation via HTML. It integrates multiple AI agents to enable a local-first design workflow.
Training-Free Lexical-Dense Fusion for Conversational-Memory Retrieval
This paper proposes a training-free, CPU-only retrieval method that fuses BM25 lexical scores with late-interaction dense scores for conversational memory retrieval, achieving up to +17.2 points improvement on LoCoMo Hit@1 over late interaction alone across six encoders. The study provides controlled ablations on pooling operators, reranker effects, and benchmark robustness, framing the gain as a division of labor between dense and lexical signals.