Count Anything (2 minute read)
Summary
Count Anything is a generalist model for text-guided object counting that unifies multiple domains, supported by the new CLOC dataset with 220K images across six visual domains. It achieves strong accuracy and multi-domain generalization.
View Cached Full Text
Cached at: 06/16/26, 12:52 AM
# Count Anything Source: [https://arxiv.org/abs/2605.30846](https://arxiv.org/abs/2605.30846) ## Title:Count Anything [View PDF](https://arxiv.org/pdf/2605.30846) > Abstract:Object counting remains fragmented across domain\-specific datasets and task formulations, despite rapid progress in generalist vision models\. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote\-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions\. In this paper, we study text\-guided object counting across domains, where a model takes an image and a natural\-language query as input and returns an instance\-grounded set of target points whose cardinality gives the count\. This formulation unifies category\-conditioned counting with interpretable spatial localization\. To support this setting, we construct CLOC, a Cross\-domain Large\-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark\. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances\. Based on CLOC, we propose Count Anything, a generalist model for text\-guided object counting\. Unlike density\-map\-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual\-granularity instance enumeration\. A Region\-level Sparse Counter provides object\-level anchors for large and sparse targets, while a Pixel\-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction\. A point\-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter\-free manner\. Extensive experiments show that Count Anything achieves strong accuracy and multi\-domain generalization, outperforming existing open\-world counting methods\. Code is available at:[this https URL](https://github.com/Mengqi-Lei/count-anything)\. ## Submission history From: Mengqi Lei \[[view email](https://arxiv.org/show-email/167da367/2605.30846)\] **\[v1\]**Fri, 29 May 2026 05:08:31 UTC \(41,518 KB\)
Similar Articles
Count Anything
Count Anything is a generalist vision model for text-guided object counting across multiple domains, using dual-granularity instance enumeration and complementary counting fusion. It achieves strong accuracy and cross-domain generalization, outperforming existing open-world counting methods.
Codex for (almost) everything
OpenAI’s Codex gains Mac app control, tool integration, image generation, memory of user preferences, and the ability to handle ongoing repeatable tasks.
@techNmak: A lightweight VLM that beats the giants at OCR. (1.7B parameters, SOTA on OmniDocBench) dots. ocr is a new multilingual…
dots.ocr is a new lightweight 1.7B parameter multilingual vision-language model that achieves state-of-the-art performance on OmniDocBench, outperforming much larger models (72B+) at document parsing and OCR tasks.
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
LocateAnything proposes Parallel Box Decoding for unified visual grounding and object detection, decoding geometric elements as atomic units to improve throughput and localization accuracy, supported by a large-scale dataset of 138M samples.
MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models
MCBench is a new benchmark for assessing the safety of omnimodal large language models across vision, audio, and text modalities. It includes 1196 scenarios and finds current models struggle with cross-modal safety reasoning.