LocateAnything：透過並行框解碼實現快速且高品質的視覺語言定位

摘要

視覺語言模型（VLM）通常將視覺定位與偵測轉化為座標標記生成問題，將每個二維邊界框序列化為多個一維標記，並以高度獨立的方式進行學習與解碼。這種逐標記解碼方式與邊界框幾何結構的耦合特性不相匹配，且因嚴格依序生成而產生實際推論瓶頸。我們提出LocateAnything，這是一個基於並行框解碼（Parallel Box Decoding, PBD）的統一生成式定位與偵測框架。透過將邊界框與點等幾何元素視為原子單位並在單一步驟中完成解碼，LocateAnything保留了框內幾何連貫性，並實現顯著的並行性。我們證明PBD能夠同時提升解碼吞吐量與定位準確度。此外，我們開發了可擴展的資料引擎，並建構出LocateAnything-Data——一個包含超過1.38億筆訓練樣本的大規模資料集，大幅增加了高精度定位所需的資料多樣性。廣泛的評估結果顯示，LocateAnything推進了速度與準確度的前沿，在顯著提高解碼吞吐量的同時，也在各類基準測試中提升了高IoU（重疊比）定位品質。這些結果突顯了並行框解碼與大規模訓練資料在實現高效且精確的統一視覺定位與偵測上的互補效益。

English

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.