椰子-PanCap:聯合全景分割和基於實證的標題,用於細粒度理解和生成
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation
February 4, 2025
作者: Xueqing Deng, Qihang Yu, Ali Athar, Chenglin Yang, Linjie Yang, Xiaojie Jin, Xiaohui Shen, Liang-Chieh Chen
cs.AI
摘要
本文介紹了COCONut-PanCap數據集,旨在增強全景分割和基於圖像的圖像標題生成。該數據集在COCO數據集的基礎上建立了先進的COCONut全景遮罩,旨在克服現有圖像文本數據集的局限,這些數據集通常缺乏詳細的、全面的場景描述。COCONut-PanCap數據集融合了基於全景分割遮罩的精細化區域級標題,確保了一致性並提高了生成標題的細節。通過人工編輯的密集標註描述,COCONut-PanCap支持改進視覺語言模型(VLMs)的訓練,用於圖像理解和文本到圖像任務的生成模型。實驗結果表明,COCONut-PanCap在理解和生成任務中顯著提升了性能,為大規模數據集提供了互補性好處。該數據集為評估模型在聯合全景分割和基於圖像的標題生成任務上的表現設立了新的基準,滿足了多模態學習中高質量、詳細的圖像文本標註的需求。
English
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic
segmentation and grounded image captioning. Building upon the COCO dataset with
advanced COCONut panoptic masks, this dataset aims to overcome limitations in
existing image-text datasets that often lack detailed, scene-comprehensive
descriptions. The COCONut-PanCap dataset incorporates fine-grained,
region-level captions grounded in panoptic segmentation masks, ensuring
consistency and improving the detail of generated captions. Through
human-edited, densely annotated descriptions, COCONut-PanCap supports improved
training of vision-language models (VLMs) for image understanding and
generative models for text-to-image tasks. Experimental results demonstrate
that COCONut-PanCap significantly boosts performance across understanding and
generation tasks, offering complementary benefits to large-scale datasets. This
dataset sets a new benchmark for evaluating models on joint panoptic
segmentation and grounded captioning tasks, addressing the need for
high-quality, detailed image-text annotations in multi-modal learning.Summary
AI-Generated Summary