基于图的标题生成:通过相互连接区域描述来增强视觉描述
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
July 9, 2024
作者: Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi
cs.AI
摘要
人类使用组合性描述复杂场景,使用简单文本描述并丰富其中的链接和关系。虽然视觉-语言研究旨在开发具有组合理解能力的模型,但现有数据集尚未反映这一点,因为大多数数据集仍然使用简单文本描述图像。在这项工作中,我们提出了一种新的注释策略,基于图的字幕(GBC),它使用带有各种类型节点的标记图结构来描述图像。GBC中的节点是通过首先使用目标检测和密集字幕工具进行嵌套递归创建的,以揭示和描述实体节点,然后在第二阶段通过突出使用新类型的节点来将这些实体链接在一起,以描述实体之间的组合和关系。由于所有GBC节点都包含简单文本描述,GBC保留了自然语言中的灵活性,但也可以在其边缘中编码分层信息。我们展示了GBC可以通过使用现成的多模态LLM和开放词汇检测模型自动产生,通过构建一个新数据集GBC10M,收集了CC12M数据集中约1000万张图像的GBC注释。我们使用GBC10M展示了GBC揭示的节点标题的丰富性,通过CLIP训练进行度量。我们展示了使用GBC节点的注释 - 特别是存储在组合和关系节点中的注释 - 相对于其他数据集格式,在下游模型上可以实现显著的性能提升。为了进一步探索GBC提供的机会,我们还提出了一种可以利用整个GBC图的新注意机制,并展示了鼓励人心的实验结果,显示了整合图结构的额外好处。我们的数据集已发布在https://huggingface.co/graph-based-captions。
English
Humans describe complex scenes with compositionality, using simple text
descriptions enriched with links and relationships. While vision-language
research has aimed to develop models with compositional understanding
capabilities, this is not reflected yet in existing datasets which, for the
most part, still use plain text to describe images. In this work, we propose a
new annotation strategy, graph-based captioning (GBC) that describes an image
using a labelled graph structure, with nodes of various types. The nodes in GBC
are created using, in a first stage, object detection and dense captioning
tools nested recursively to uncover and describe entity nodes, further linked
together in a second stage by highlighting, using new types of nodes,
compositions and relations among entities. Since all GBC nodes hold plain text
descriptions, GBC retains the flexibility found in natural language, but can
also encode hierarchical information in its edges. We demonstrate that GBC can
be produced automatically, using off-the-shelf multimodal LLMs and
open-vocabulary detection models, by building a new dataset, GBC10M, gathering
GBC annotations for about 10M images of the CC12M dataset. We use GBC10M to
showcase the wealth of node captions uncovered by GBC, as measured with CLIP
training. We show that using GBC nodes' annotations -- notably those stored in
composition and relation nodes -- results in significant performance boost on
downstream models when compared to other dataset formats. To further explore
the opportunities provided by GBC, we also propose a new attention mechanism
that can leverage the entire GBC graph, with encouraging experimental results
that show the extra benefits of incorporating the graph structure. Our datasets
are released at https://huggingface.co/graph-based-captions.Summary
AI-Generated Summary