ChatPaper.aiChatPaper

计数一切

Count Anything

May 29, 2026
作者: Mengqi Lei, Shuokun Cheng, Wei Bao, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao
cs.AI

摘要

目标计数领域仍受限于领域特定的数据集和任务形式,尽管通用视觉模型取得了快速进展。现有计数模型通常针对人群、车辆、细胞、农作物或遥感目标等场景设计,因而难以跨类别、视觉域、目标尺度和密度分布进行泛化。本文研究跨域文本引导的目标计数,模型以图像和自然语言查询为输入,输出一组与实例对应的目标点,其基数即为计数结果。该形式统一了基于类别的计数与可解释的空间定位。为支持这一设定,我们构建了CLOC(跨域大规模目标计数数据集),将多样化的公共数据源重组为统一基准。CLOC涵盖六个视觉域:通用场景、遥感、组织病理学、细胞显微成像、农业和微生物学,包含约22万张图像、619个类别和1500万个目标实例。基于CLOC,我们提出Count Anything,一种用于文本引导目标计数的通用模型。与主导计数模型的密度图方法不同,Count Anything采用离散实例点并进行双粒度实例枚举。区域级稀疏计数器为大尺度稀疏目标提供目标级锚点,而像素级密集计数器通过密集点预测处理小尺度、拥挤及弱边界目标。点中心监督策略可从异构标注中学习,互补计数融合则以无参数方式结合两种计数器。大量实验表明,Count Anything在准确性和多域泛化方面表现优异,优于现有开放世界计数方法。代码已开源:https://github.com/Mengqi-Lei/count-anything。
English
Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.