基於圖形的標題生成:通過互連區域標題來增強視覺描述
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
July 9, 2024
作者: Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi
cs.AI
摘要
人類使用組合性以簡單文字描述複雜場景,並豐富其內容以連結和關係。儘管視覺語言研究旨在開發具有組成理解能力的模型,但現有數據集尚未反映這一點,因為大多數數據集仍然使用純文本描述圖像。在本研究中,我們提出了一種新的標註策略,即基於圖的標註(GBC),它使用帶有各種類型節點的標記圖結構來描述圖像。GBC中的節點是通過第一階段使用物體檢測和密集標註工具進行嵌套遞歸創建的,以揭示和描述實體節點,然後在第二階段通過突出使用新類型節點將這些實體進一步連接在一起,以突出實體之間的組合和關係。由於所有GBC節點都包含純文本描述,GBC保留了自然語言中發現的靈活性,但也可以在其邊緣中編碼層次信息。我們展示了GBC可以通過使用現成的多模式LLMs和開放詞彙檢測模型自動生成,並通過構建一個新數據集GBC10M,收集了CC12M數據集中約10M張圖像的GBC標註。我們使用GBC10M展示了GBC發現的節點標題的豐富性,並通過CLIP訓練進行了量化。我們展示了使用GBC節點的標註 - 尤其是存儲在組合和關係節點中的標註 - 與其他數據集格式相比,在下游模型上實現了顯著的性能提升。為了進一步探索GBC提供的機會,我們還提出了一種可以利用整個GBC圖的新注意機制,並展示了將圖結構納入其中的額外好處的鼓舞人心的實驗結果。我們的數據集已在https://huggingface.co/graph-based-captions 上發布。
English
Humans describe complex scenes with compositionality, using simple text
descriptions enriched with links and relationships. While vision-language
research has aimed to develop models with compositional understanding
capabilities, this is not reflected yet in existing datasets which, for the
most part, still use plain text to describe images. In this work, we propose a
new annotation strategy, graph-based captioning (GBC) that describes an image
using a labelled graph structure, with nodes of various types. The nodes in GBC
are created using, in a first stage, object detection and dense captioning
tools nested recursively to uncover and describe entity nodes, further linked
together in a second stage by highlighting, using new types of nodes,
compositions and relations among entities. Since all GBC nodes hold plain text
descriptions, GBC retains the flexibility found in natural language, but can
also encode hierarchical information in its edges. We demonstrate that GBC can
be produced automatically, using off-the-shelf multimodal LLMs and
open-vocabulary detection models, by building a new dataset, GBC10M, gathering
GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to
showcase the wealth of node captions uncovered by GBC, as measured with CLIP
training. We show that using GBC nodes' annotations -- notably those stored in
composition and relation nodes -- results in significant performance boost on
downstream models when compared to other dataset formats. To further explore
the opportunities provided by GBC, we also propose a new attention mechanism
that can leverage the entire GBC graph, with encouraging experimental results
that show the extra benefits of incorporating the graph structure. Our datasets
are released at https://huggingface.co/graph-based-captions.Summary
AI-Generated Summary