GroundingSuite:复杂多粒度像素级定位能力评估
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding
March 13, 2025
作者: Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang
cs.AI
摘要
像素定位任务,包括指称表达分割(RES),因其在视觉与语言模态间架设桥梁的巨大潜力而备受关注。然而,该领域的发展目前受限于现有数据集的固有缺陷,如对象类别有限、文本多样性不足以及高质量标注稀缺。为缓解这些限制,我们推出了GroundingSuite,它包含:(1)一个利用多视觉语言模型(VLM)代理的自动化数据标注框架;(2)一个大规模训练数据集,涵盖956万条多样化的指称表达及其对应的分割结果;(3)一个精心策划的评估基准,包含3,800张图像。GroundingSuite训练数据集显著提升了模型性能,使基于其训练的模型达到了最先进水平,具体表现为在gRefCOCO上取得68.9的cIoU,在RefCOCOm上获得55.3的gIoU。此外,GroundingSuite标注框架展现出相较于当前领先数据标注方法(即GLaMM)更高的效率,速度提升了4.5倍。
English
Pixel grounding, encompassing tasks such as Referring Expression Segmentation
(RES), has garnered considerable attention due to its immense potential for
bridging the gap between vision and language modalities. However, advancements
in this domain are currently constrained by limitations inherent in existing
datasets, including limited object categories, insufficient textual diversity,
and a scarcity of high-quality annotations. To mitigate these limitations, we
introduce GroundingSuite, which comprises: (1) an automated data annotation
framework leveraging multiple Vision-Language Model (VLM) agents; (2) a
large-scale training dataset encompassing 9.56 million diverse referring
expressions and their corresponding segmentations; and (3) a meticulously
curated evaluation benchmark consisting of 3,800 images. The GroundingSuite
training dataset facilitates substantial performance improvements, enabling
models trained on it to achieve state-of-the-art results. Specifically, a cIoU
of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the
GroundingSuite annotation framework demonstrates superior efficiency compared
to the current leading data annotation method, i.e., 4.5 times faster than
the GLaMM.Summary
AI-Generated Summary