Tag
This paper introduces visually grounded thinking, a method for vision-language models to interleave natural-language reasoning with explicit visual evidence grounding using points or boxes. A scalable synthesis pipeline and grounding-aware reinforcement learning improve reasoning accuracy, enabling a 4B model to match or surpass a 27B model on spatial and counting benchmarks.
Count Anything is a generalist vision model for text-guided object counting across multiple domains, using dual-granularity instance enumeration and complementary counting fusion. It achieves strong accuracy and cross-domain generalization, outperforming existing open-world counting methods.
The article argues that despite modern scientific instruments, all measurements ultimately derive from two ancient techniques: comparison and counting, illustrated through examples like rulers and sundials.