CoVLM:通过交流解码在大型语言模型中组合视觉实体和关系
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
November 6, 2023
作者: Junyan Li, Delin Chen, Yining Hong, Zhenfang Chen, Peihao Chen, Yikang Shen, Chuang Gan
cs.AI
摘要
人类的一个显著能力在于组合推理,即能够实现“有限手段的无限应用”。然而,当前的大型视觉-语言基础模型(VLMs)由于其“词袋”行为和无法构建正确代表视觉实体及实体之间关系的单词,而在这种组合能力方面存在不足。为此,我们提出了CoVLM,它可以引导LLM明确地组合视觉实体和文本之间的关系,并动态地与视觉编码器和检测网络进行通信,以实现视觉-语言交流解码。具体而言,我们首先为LLM设计了一组新颖的通信令牌,用于视觉检测系统与语言系统之间的动态通信。LLM生成通信令牌,以通知检测网络提出与迄今为止生成的句子相关的区域,这些区域被提出为感兴趣区域(ROIs),然后反馈到LLM,以便根据相关区域更好地生成语言。LLM因此能够通过通信令牌组合视觉实体和关系。直到整个句子生成完毕,视觉到语言和语言到视觉的通信才会迭代执行。我们的框架无缝地弥合了视觉感知和LLMs之间的差距,并在组合推理基准测试中表现出色,比以往的VLMs表现优异(例如,在HICO-DET mAP上提高了约20%,在Cola top-1准确率上提高了约14%,在ARO top-1准确率上提高了约3%)。我们还在传统的视觉-语言任务中取得了最先进的表现,如指代表达理解和视觉问题回答。
English
A remarkable ability of human beings resides in compositional reasoning,
i.e., the capacity to make "infinite use of finite means". However, current
large vision-language foundation models (VLMs) fall short of such compositional
abilities due to their "bag-of-words" behaviors and inability to construct
words that correctly represent visual entities and the relations among the
entities. To this end, we propose CoVLM, which can guide the LLM to explicitly
compose visual entities and relationships among the text and dynamically
communicate with the vision encoder and detection network to achieve
vision-language communicative decoding. Specifically, we first devise a set of
novel communication tokens for the LLM, for dynamic communication between the
visual detection system and the language system. A communication token is
generated by the LLM following a visual entity or a relation, to inform the
detection network to propose regions that are relevant to the sentence
generated so far. The proposed regions-of-interests (ROIs) are then fed back
into the LLM for better language generation contingent on the relevant regions.
The LLM is thus able to compose the visual entities and relationships through
the communication tokens. The vision-to-language and language-to-vision
communication are iteratively performed until the entire sentence is generated.
Our framework seamlessly bridges the gap between visual perception and LLMs and
outperforms previous VLMs by a large margin on compositional reasoning
benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on
ARO top-1 accuracy). We also achieve state-of-the-art performances on
traditional vision-language tasks such as referring expression comprehension
and visual question answering.