GroundingSuite：測量複雜多粒度像素定位

摘要

像素接地（Pixel Grounding），包括指代表達式分割（Referring Expression Segmentation, RES）等任務，因其在視覺與語言模態之間架起橋樑的巨大潛力而受到廣泛關注。然而，該領域的進展目前受到現有數據集固有局限性的制約，包括有限的物體類別、文本多樣性不足以及高質量註解的稀缺。為緩解這些限制，我們引入了GroundingSuite，它包含：（1）一個利用多個視覺-語言模型（Vision-Language Model, VLM）代理的自動化數據註解框架；（2）一個大規模訓練數據集，涵蓋了956萬個多樣化的指代表達式及其對應的分割結果；（3）一個精心策劃的評估基準，包含3,800張圖像。GroundingSuite訓練數據集促進了性能的顯著提升，使基於其訓練的模型能夠達到最先進的成果，具體而言，在gRefCOCO上實現了68.9的cIoU，在RefCOCOm上實現了55.3的gIoU。此外，GroundingSuite註解框架展現出相較於當前領先數據註解方法（即GLaMM）更高的效率，速度快達4.5倍。

English

Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5 times faster than the GLaMM.

GroundingSuite：測量複雜多粒度像素定位

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

摘要

Support