video-generation-models

#video-generation-models

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Hugging Face Daily Papers ↗ · 2026-05-27 Cached

This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.

0 favorites 0 likes

video-generation-models

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Submit Feedback