Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

Summary

Introduces Future-L1, an interleaved latent visual reasoning framework that improves video event prediction by maintaining visual semantics in latent space. Achieves state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

Original Article

View Cached Full Text

Cached at: 06/05/26, 06:07 AM

Paper page - Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Source: https://huggingface.co/papers/2606.05769

Abstract

Future-L1, an interleaved latent visual reasoning framework, improves video event prediction by maintaining visual semantics in latent space during autoregressive decoding, achieving state-of-the-art results on FutureBench and TwiFF-Bench benchmarks.

Video event prediction(VEP) requires models to infer unobserved future states from partial video evidence. Existingvideo MLLMsusually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleavedlatent visual reasoningframework that lets an MLLM alternate betweenlanguage tokensand continuous latent visual spans duringautoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories withLA-DAPO, alatent-aware RL objectivewith outcome-contrastive andtemporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: onFutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; onTwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.05769

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05769 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.05769 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05769 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Paper page - Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

Submit Feedback

Similar Articles

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence