A chessboard is a surprisingly good way to catch what VLMs still get wrong
Summary
An informal experiment using a chessboard reveals that vision language models often fail at spatial reasoning and precise structured output, despite correctly recognizing pieces, highlighting a key gap in VLM evaluation.
Similar Articles
Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models
This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
The paper introduces SpatialUncertain, a benchmark to evaluate whether vision-language models recognize when they cannot answer spatial questions due to occlusion or perspective ambiguity, revealing overconfidence and poor abstention behavior.
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
Introduces Flat-Pack Bench, a benchmark for evaluating fine-grained spatio-temporal reasoning in large vision-language models using furniture assembly tasks. Experiments show current LVLMs struggle with tracking and spatial interactions.
VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization
This paper introduces a paradigm where Vision-Language Models (VLMs) act as test-time teachers to guide Video Generation Models (VGMs) via differentiable rewards and LoRA optimization, achieving a 16.7-point average improvement on video reasoning benchmarks.
Revealing Interpretable Failure Modes of VLMs
This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.