LocateAnything: 병렬 박스 디코딩을 통한 빠르고 고품질의 비전-언어 그라운딩

초록

비전-언어 모델(VLM)은 일반적으로 시각적 접지와 검출을 좌표-토큰 생성 문제로 정식화하여, 각 2D 박스를 여러 개의 1D 토큰으로 직렬화한 후 대부분 독립적으로 학습 및 디코딩한다. 이러한 토큰 단위 디코딩은 박스 기하 구조의 결합된 특성과 부합하지 않으며, 엄격한 순차적 생성으로 인해 실질적인 추론 병목을 초래한다. 본 논문에서는 병렬 박스 디코딩(PBD)에 기반한 통합 생성형 접지 및 검출 프레임워크인 LocateAnything을 소개한다. 경계 상자와 점과 같은 기하 요소를 단일 단계에서 원자 단위로 디코딩함으로써, LocateAnything은 박스 내 기하학적 일관성을 유지하고 상당한 병렬성을 확보한다. PBD가 디코딩 처리량과 국소화 정확도를 모두 향상시킴을 보인다. 또한 확장 가능한 데이터 엔진을 개발하고, 1억 3800만 개 이상의 학습 샘플을 포함하는 대규모 데이터셋인 LocateAnything-Data를 구축하여 고정밀 국소화를 위한 데이터 다양성을 크게 증가시킨다. 광범위한 평가 결과, LocateAnything이 속도-정확도 프론티어를 발전시켜 다양한 벤치마크에서 디코딩 처리량을 현저히 높이는 동시에 높은 IoU 국소화 품질을 개선함을 보여준다. 이러한 결과는 병렬 박스 디코딩과 대규모 학습 데이터가 효율적이고 정밀한 통합 시각적 접지 및 검출을 가능하게 하는 상호 보완적 이점을 강조한다.

English

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.