무엇이든 세기

초록

객체 계수는 범용 비전 모델의 급속한 발전에도 불구하고 도메인 특화 데이터셋과 작업 공식에 걸쳐 분산된 상태로 남아 있다. 기존 계수 모델은 종종 군중, 차량, 세포, 작물, 원격 감지 객체 등 특정 시나리오에 맞춰져 있어, 카테고리, 시각적 도메인, 객체 규모 및 밀도 분포에 걸쳐 일반화하는 데 어려움을 겪는다. 본 논문에서는 모델이 이미지와 자연어 질의를 입력으로 받아 개수를 나타내는 인스턴스 기반의 대상 점 집합을 반환하는 도메인 간 텍스트 기반 객체 계수를 연구한다. 이 공식은 카테고리 조건부 계수를 해석 가능한 공간적 위치 식별과 통합한다. 이러한 설정을 지원하기 위해, 우리는 다양한 공개 데이터 소스를 통합 벤치마크로 재구성한 CLOC(Cross-domain Large-scale Object Counting) 데이터셋을 구축한다. CLOC는 일반 장면, 원격 감지, 조직병리학, 세포 현미경, 농업, 미생물학의 여섯 가지 시각적 도메인을 포괄하며, 약 22만 개의 이미지, 619개의 카테고리, 1500만 개의 객체 인스턴스를 포함한다. CLOC를 기반으로, 우리는 텍스트 기반 객체 계수를 위한 범용 모델인 Count Anything을 제안한다. 계수 모델을 지배하는 밀도 맵 기반 방법과 달리, Count Anything은 이산 인스턴스 점을 채택하고 이중 세분화 인스턴스 열거를 수행한다. 영역 수준 희소 카운터는 크고 희소한 대상에 대해 객체 수준의 앵커를 제공하며, 픽셀 수준 밀집 카운터는 밀집 점 예측을 통해 작고 밀집되며 경계가 약한 대상을 처리한다. 점 중심 지도 전략은 이질적 주석으로부터 학습을 가능하게 하며, 보완적 계수 융합(Complementary Count Fusion)은 두 카운터를 매개변수 없이 결합한다. 광범위한 실험을 통해 Count Anything이 강력한 정확도와 다중 도메인 일반화를 달성하여 기존의 개방형 세계 계수 방법을 능가함을 보여준다. 코드는 https://github.com/Mengqi-Lei/count-anything에서 확인할 수 있다.

English

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.