GroundingSuite：複雑なマルチグラニュラリティなピクセルグラウンディングの測定

要旨

ピクセルグラウンディングは、Referring Expression Segmentation（RES）などのタスクを含み、視覚と言語モダリティのギャップを埋めるという大きな可能性から、注目を集めています。しかし、この分野の進展は、現存するデータセットの制約によって制限されています。具体的には、限られたオブジェクトカテゴリ、不十分なテキストの多様性、そして高品質なアノテーションの不足などが挙げられます。これらの制約を緩和するため、我々はGroundingSuiteを導入します。これは、(1) 複数のVision-Language Model（VLM）エージェントを活用した自動データアノテーションフレームワーク、(2) 956万の多様な参照表現とそれに対応するセグメンテーションを含む大規模なトレーニングデータセット、(3) 3,800枚の画像からなる厳選された評価ベンチマークで構成されています。GroundingSuiteのトレーニングデータセットは、モデルの性能を大幅に向上させ、それに基づいてトレーニングされたモデルが最先端の結果を達成することを可能にします。具体的には、gRefCOCOでcIoU 68.9、RefCOCOmでgIoU 55.3を達成しました。さらに、GroundingSuiteのアノテーションフレームワークは、現在の主要なデータアノテーション手法（例えば、GLaMM）と比較して、4.5倍の効率性を示しています。

English

Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5 times faster than the GLaMM.

GroundingSuite：複雑なマルチグラニュラリティなピクセルグラウンディングの測定

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

要旨

Support