@ManlingLi_: Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We int…

X AI KOLs Following Papers

Summary

Introduces ViewSuite, a benchmark with 6DoF camera control and ~165K tasks for evaluating VLMs' ability to plan camera moves. Finds a planning gap where models can track but not compose plans, and proposes View Graph Distillation (RL-Graph-SFT) to improve success from 2.5% to 47.8%.

Planning with the views: Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning A sharp Planning Gap emerges: + can roughly "track" how camera action changes views - cannot "compose" a plan towards a target view at all We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B. + With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8% Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks? Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang. A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.
Original Article
View Cached Full Text

Cached at: 06/18/26, 10:11 PM

Planning with the views:

Can VLMs predict how each camera move changes the view, and plan many such moves ahead?

We introduce ViewSuite with 6 DoF camera control and ~165K task instances, testing: Path-to-View View-to-Path Interactive View Planning

A sharp Planning Gap emerges:

  • can roughly “track” how camera action changes views
  • cannot “compose” a plan towards a target view at all

We then try to teach VLMs with Reinforcement Learning. - RL cannot teach VLMs such planning ability, only 2.5% success rate with Qwen2.5-VL-7B.

  • With View Graph Distillation (our RL-Graph-SFT framework), 2.5% → 47.8%

Below, we answer these questions: Q1. What are the failure modes? Q2. How can we make RL work? Q3. What has the model learned? Can we open up the model to see before/after? Can such spatial priors transfer to other view related tasks?

Led by @James_KKW, great to work with @LINJIEFUN @zhengyuan_yang @shiqi_chen17 @wzenus @drfeifei @jiajunwu_cs Leonidas Guibas, Lijuan Wang.

A joint efforts with @StanfordAILab @StanfordSVL @MSFTResearch.

Similar Articles

A better method for planning complex visual tasks

MIT News — Artificial Intelligence

MIT researchers developed VLMFP, a two-stage generative AI approach combining vision-language models with formal planning software to achieve 70% success rate on complex visual planning tasks like robot navigation, nearly 2.3x better than existing baselines. The method automatically translates visual scenarios into planning files that classical solvers can process, enabling effective long-horizon planning in novel environments.

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

Hugging Face Daily Papers

SpatialAct is a new simulator-grounded benchmark that probes whether VLM agents can perform coherent spatial reasoning and translate it into actions in 3D environments across multi-turn feedback settings. Experiments reveal a significant reasoning-to-action gap, with current VLMs struggling to maintain spatial beliefs and produce reliable actions despite performing well on isolated reasoning tasks.

OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

Hugging Face Daily Papers

OneVL is a unified vision-language-action framework that compresses chain-of-thought reasoning into latent tokens supervised by both language and visual world model decoders, achieving state-of-the-art trajectory prediction accuracy for autonomous driving at answer-only inference latency. It is the first latent CoT method to surpass explicit CoT across four benchmarks.