何でも数える

要旨

物体カウントは、汎用ビジョンモデルの急速な進歩にもかかわらず、ドメイン特化型データセットやタスク定式化の間で断片化されたままである。既存のカウントモデルは、群衆、車両、細胞、作物、リモートセンシング物体などのシナリオに特化して調整されることが多く、そのためカテゴリ、視覚ドメイン、物体スケール、密度分布を横断した汎化に苦慮している。本論文では、ドメイン横断的なテキスト誘導型物体カウントを研究する。この設定では、モデルが画像と自然言語クエリを入力として受け取り、その基数がカウントを与えるインスタンスに基づくターゲット点集合を返す。この定式化は、カテゴリ条件付きカウントと解釈可能な空間位置特定を統一する。この設定を支援するために、我々は多様な公開データソースを統一ベンチマークに再編成したクロスドメイン大規模物体カウントデータセットCLOCを構築した。CLOCは6つの視覚ドメイン（一般シーン、リモートセンシング、病理組織学、細胞顕微鏡、農業、微生物学）をカバーし、約22万枚の画像、619カテゴリ、1500万の物体インスタンスを含む。CLOCに基づき、我々はテキスト誘導型物体カウントのための汎用モデルCount Anythingを提案する。カウントモデルを支配する密度マップベース手法とは異なり、Count Anythingは離散インスタンス点を採用し、二重粒度のインスタンス列挙を実行する。領域レベル疎カウンタは大きく疎なターゲットに対する物体レベルのアンカーを提供し、一方ピクセルレベル密カウンタは密な点予測を通じて小さく密集し弱い境界を持つターゲットを扱う。点中心の監視戦略により異種アノテーションからの学習が可能となり、補完的カウント融合が両方のカウンタをパラメータフリー方式で結合する。広範な実験により、Count Anythingが高い精度とマルチドメイン汎化を達成し、既存のオープンワールドカウント手法を凌駕することが示される。コードは https://github.com/Mengqi-Lei/count-anything で入手可能である。

English

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.