Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Summary
Introduces Future-L1, an interleaved latent visual reasoning framework that improves video event prediction by maintaining visual semantics in latent space. Achieves state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.
View Cached Full Text
Cached at: 06/05/26, 06:07 AM
Paper page - Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Source: https://huggingface.co/papers/2606.05769
Abstract
Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.
Video event prediction(VEP) requires models to infer unobserved future states from partial video evidence. Existingvideo MLLMsusually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleavedlatent visual reasoningframework that lets an MLLM alternate betweenlanguage tokensand continuous latent visual spans duringautoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories withLA-DAPO, alatent-aware RL objectivewith outcome-contrastive andtemporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: onFutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; onTwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.05769
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05769 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.05769 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05769 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
VisualThink-VLA introduces a visual intermediate reasoning framework for vision-language-action policies that preserves spatial precision and dramatically reduces latency compared to text-based reasoning, achieving sub-second inference and state-of-the-art success rates on robot manipulation benchmarks.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning
This paper introduces RIS, a framework for spatial-semantic grounded latent visual reasoning in Multimodal Large Language Models to overcome information bottlenecks. It proposes anchoring latent tokens to spatial and semantic evidence, showing improvements on benchmarks like V* and HRBench.
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix is a hierarchical video prediction framework that improves future video generation by first predicting semantic representations using frozen vision foundation models, then conditioning a latent diffusion model on these predictions to generate photorealistic frames. The approach addresses train-test mismatches through nested dropout and mixed supervision strategies, achieving improved temporal semantic consistency and perceptual quality on autonomous driving benchmarks.
LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
LLaVA-OneVision-2 introduces codec-stream tokenization and windowed attention for efficient video understanding, achieving state-of-the-art performance across multiple multimodal benchmarks including video, spatial, and tracking tasks.