LocateAnything: 並列ボックスデコードによる高速かつ高品質な視覚言語グラウンディング

要旨

視覚言語モデル（VLM）は一般的に、ビジュアルグラウンディングと検出を座標トークン生成問題として定式化し、各2Dボックスを複数の1Dトークンに系列化して、それらをほぼ独立に学習・デコードする。このトークン単位のデコードは、ボックス幾何の結合構造とミスマッチを生じるとともに、厳密な逐次生成による実用的な推論ボトルネックを生み出す。本稿では、並列ボックスデコード（PBD）に基づく統一型生成グラウンディング・検出フレームワークであるLocateAnythingを提案する。バウンディングボックスや点などの幾何要素を原子単位として単一ステップでデコードすることで、LocateAnythingはボックス内の幾何的一貫性を保持し、大幅な並列性を実現する。PBDがデコードスループットと位置特定精度の両方を向上させることを示す。さらに、スケーラブルなデータエンジンを開発し、1億3800万以上のトレーニングサンプルを含む大規模データセットLocateAnything-Dataをキュレーションすることで、高精度位置特定のためのデータ多様性を大幅に向上させる。広範な評価により、LocateAnythingは速度と精度のフロンティアを前進させ、デコードスループットを大幅に向上させると同時に、多様なベンチマークで高IoU位置特定品質を改善することを示す。これらの結果は、並列ボックスデコードと大規模トレーニングデータが、効率的かつ高精度な統一ビジュアルグラウンディング・検出を可能にする相補的な利点を浮き彫りにしている。

English

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.