任意計數
Count Anything
May 29, 2026
作者: Mengqi Lei, Shuokun Cheng, Wei Bao, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao
cs.AI
摘要
儘管通用視覺模型快速進展,物件計數仍然分散在不同領域的資料集與任務表述中。現有的計數模型通常針對人群、車輛、細胞、農作物或遙感物件等特定場景設計,導致難以泛化至不同類別、視覺領域、物件尺度與密度分布。本文研究跨領域的文字引導物件計數,模型以影像與自然語言查詢為輸入,回傳一組實例對應的目標點集,其基數即為計數結果。此表述統一了類別條件計數與可解釋的空間定位。為支援此設定,我們建構了 CLOC(跨領域大規模物件計數資料集),將多樣化的公開資料來源重整為統一的基準。CLOC 涵蓋六大視覺領域:一般場景、遙感、組織病理學、細胞顯微鏡、農業與微生物學,包含約 22 萬張影像、619 個類別與 1500 萬個物件實例。基於 CLOC,我們提出 Count Anything,一種用於文字引導物件計數的通用模型。不同於主流基於密度圖的方法,Count Anything 採用離散的實例點並執行雙粒度實例列舉。區域級稀疏計數器為大尺度與稀疏目標提供物件層級的錨點,而像素級密集計數器透過密集點預測處理小型、擁擠與邊界模糊的目標。點中心監督策略能從異質標註中學習,互補計數融合則以無參數方式結合兩個計數器。大量實驗顯示,Count Anything 達到高準確度與多領域泛化能力,優於現有的開放世界計數方法。程式碼位於:https://github.com/Mengqi-Lei/count-anything。
English
Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.