@ManlingLi_: Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We int…

X AI KOLs Following 06/18/26, 07:53 PM Papers

vision-language-models camera-control reinforcement-learning view-planning dataset research ai

Summary

Introduces ViewSuite, a benchmark with 6DoF camera control and ~165K tasks for evaluating VLMs' ability to plan camera moves. Finds a planning gap where models can track but not compose plans, and proposes View Graph Distillation (RL-Graph-SFT) to improve success from 2.5% to 47.8%.

Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning A sharp Planning Gap emerges: + can roughly "track" how camera action changes views - cannot "compose" a plan towards a target view at all We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B. + With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8% Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks? Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang. A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.

Original Article

View Cached Full Text

Cached at: 06/18/26, 10:11 PM

Planning with the views:

Can VLMs predict how each camera move changes the view, and plan many such moves ahead?

We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning

A sharp Planning Gap emerges:

can roughly “track” how camera action changes views

cannot “compose” a plan towards a target view at all

We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B.

With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%

Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?

Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang.

A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.

@ManlingLi_: Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We int…

Similar Articles

A better method for planning complex visual tasks

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Submit Feedback

Similar Articles

A better method for planning complex visual tasks

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation