Count Anything

Hugging Face Daily Papers 05/29/26, 12:00 AM Papers

object-counting text-guided multi-domain computer-vision generalist-model instance-enumeration counting

Summary

Count Anything is a generalist vision model for text-guided object counting across multiple domains, using dual-granularity instance enumeration and complementary counting fusion. It achieves strong accuracy and cross-domain generalization, outperforming existing open-world counting methods.

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

Original Article

View Cached Full Text

Cached at: 06/01/26, 07:18 AM

Paper page - Count Anything

Source: https://huggingface.co/papers/2605.30846

Abstract

A generalist model for text-guided object counting across multiple domains is presented, utilizing dual-granularity instance enumeration and complementary counting fusion for improved accuracy and cross-domain generalization.

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we studytext-guided object countingacross domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model fortext-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performsdual-granularity instance enumeration. ARegion-level Sparse Counterprovides object-level anchors for large and sparse targets, while aPixel-level Dense Counterhandles small, crowded, and weakly bounded targets via dense point prediction. Apoint-centric supervisionstrategy enables learning from heterogeneous annotations, andComplementary Count Fusioncombines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

View arXiv page View PDF GitHub7 Add to collection

Models citing this paper1

#### MengqiLei/count-anything Object Detection• Updatedabout 6 hours ago • 3

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.30846 in a dataset README.md to link it from this page.

Spaces citing this paper1

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Count Anything

Paper page - Count Anything

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper1

Collections including this paper0

Similar Articles

Count Anything (2 minute read)

#Exploration: A study of count-based exploration for deep reinforcement learning

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

How are you handling aggregation/counting questions in doc-aware agents? RAG keeps failing me here

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Submit Feedback

Similar Articles

Count Anything (2 minute read)

#Exploration: A study of count-based exploration for deep reinforcement learning

CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models

How are you handling aggregation/counting questions in doc-aware agents? RAG keeps failing me here

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark