椰子-PanCap：聯合全景分割和基於實證的標題，用於細粒度理解和生成

摘要

本文介紹了COCONut-PanCap數據集，旨在增強全景分割和基於圖像的圖像標題生成。該數據集在COCO數據集的基礎上建立了先進的COCONut全景遮罩，旨在克服現有圖像文本數據集的局限，這些數據集通常缺乏詳細的、全面的場景描述。COCONut-PanCap數據集融合了基於全景分割遮罩的精細化區域級標題，確保了一致性並提高了生成標題的細節。通過人工編輯的密集標註描述，COCONut-PanCap支持改進視覺語言模型（VLMs）的訓練，用於圖像理解和文本到圖像任務的生成模型。實驗結果表明，COCONut-PanCap在理解和生成任務中顯著提升了性能，為大規模數據集提供了互補性好處。該數據集為評估模型在聯合全景分割和基於圖像的標題生成任務上的表現設立了新的基準，滿足了多模態學習中高質量、詳細的圖像文本標註的需求。

English

This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon the COCO dataset with advanced COCONut panoptic masks, this dataset aims to overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The COCONut-PanCap dataset incorporates fine-grained, region-level captions grounded in panoptic segmentation masks, ensuring consistency and improving the detail of generated captions. Through human-edited, densely annotated descriptions, COCONut-PanCap supports improved training of vision-language models (VLMs) for image understanding and generative models for text-to-image tasks. Experimental results demonstrate that COCONut-PanCap significantly boosts performance across understanding and generation tasks, offering complementary benefits to large-scale datasets. This dataset sets a new benchmark for evaluating models on joint panoptic segmentation and grounded captioning tasks, addressing the need for high-quality, detailed image-text annotations in multi-modal learning.

椰子-PanCap：聯合全景分割和基於實證的標題，用於細粒度理解和生成

COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation

摘要

Support