A chessboard is a surprisingly good way to catch what VLMs still get wrong

Reddit r/artificial News

Summary

An informal experiment using a chessboard reveals that vision language models often fail at spatial reasoning and precise structured output, despite correctly recognizing pieces, highlighting a key gap in VLM evaluation.

Spent some time testing what vision language models actually understand versus what they can describe. A chessboard turned out to be a great probe because there is one correct answer for the layout (the FEN string). The models usually recognize the pieces, then write them onto the wrong squares. So the gap is not really perception, it is spatial reasoning and getting the structured output exactly right. This made me rethink how we benchmark these things. Accuracy on loose descriptions hides the part that breaks in production. We ran this at VideoDB Labs as part of a wider look at VLM evaluation. What is a task you have found that exposes the real limits of these models?
Original Article

Similar Articles

Which Pretraining Paradigm Better Serves Spatial Intelligence? An Empirical Comparison of Vision-Language and Video Generation Models

Hugging Face Daily Papers

This paper presents a systematic frozen-feature probing study comparing vision-language models (VLMs) and video generation models (VGMs) on spatial intelligence tasks. It finds that VLMs excel at semantic tagging and instance grouping, while VGMs provide better dense geometry and camera motion signals, and a naive fusion of both yields strong performance across all axes.

Revealing Interpretable Failure Modes of VLMs

arXiv cs.AI

This paper introduces Revelio, a framework that systematically discovers interpretable failure modes in Vision-Language Models (VLMs) by searching over discrete concept combinations. Applied to autonomous driving and indoor robotics, it reveals previously unreported vulnerabilities that lead to crashes or safety hazards.