GroundingSuite: 복잡한 다중 세분화 픽셀 그라운딩 측정

초록

픽셀 그라운딩(Pixel Grounding)은 Referring Expression Segmentation(RES)과 같은 작업을 포함하며, 시각과 언어 모달리티 간의 격차를 해소할 수 있는 막대한 잠재력으로 인해 상당한 관심을 받고 있습니다. 그러나 이 분야의 발전은 현재 기존 데이터셋의 한계로 인해 제약을 받고 있습니다. 이러한 한계에는 제한된 객체 카테고리, 불충분한 텍스트 다양성, 그리고 고품질 주석의 부족 등이 포함됩니다. 이러한 한계를 완화하기 위해, 우리는 GroundingSuite를 소개합니다. GroundingSuite는 다음과 같은 요소로 구성됩니다: (1) 다중 Vision-Language Model(VLM) 에이전트를 활용한 자동화된 데이터 주석 프레임워크; (2) 956만 개의 다양한 참조 표현(referring expression)과 해당 세그멘테이션을 포함한 대규모 훈련 데이터셋; 그리고 (3) 3,800개의 이미지로 구성된 세심하게 선별된 평가 벤치마크. GroundingSuite 훈련 데이터셋은 모델의 성능을 크게 향상시켜, 이를 기반으로 훈련된 모델들이 최첨단 결과를 달성할 수 있도록 합니다. 구체적으로, gRefCOCO에서 68.9의 cIoU와 RefCOCOm에서 55.3의 gIoU를 달성했습니다. 또한, GroundingSuite 주석 프레임워크는 현재 선도적인 데이터 주석 방법인 GLaMM보다 4.5배 더 빠른 우수한 효율성을 보여줍니다.

English

Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite training dataset facilitates substantial performance improvements, enabling models trained on it to achieve state-of-the-art results. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5 times faster than the GLaMM.

GroundingSuite: 복잡한 다중 세분화 픽셀 그라운딩 측정

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

초록

Support