CoVLM：透過溝通式解碼在大型語言模型中組合視覺實體和關係

摘要

人類的一項卓越能力在於組合推理，即具備「有限手段實現無限用途」的能力。然而，目前的大型視覺語言基礎模型（VLMs）由於其「詞袋」行為和無法構建正確代表視覺實體及實體間關係的詞彙，因此缺乏這種組合能力。為此，我們提出了CoVLM，可以引導LLM明確地組合視覺實體和文本間的關係，並動態地與視覺編碼器和檢測網絡進行通信，實現視覺語言交互解碼。具體而言，我們首先為LLM設計了一組新型通信令牌，用於視覺檢測系統與語言系統之間的動態通信。通信令牌是由LLM生成的，根據視覺實體或關係，通知檢測網絡提出與迄今為止生成的句子相關的區域。然後，提出的感興趣區域（ROIs）被反饋到LLM中，以便根據相關區域進行更好的語言生成。因此，LLM能夠通過通信令牌組合視覺實體和關係。視覺到語言和語言到視覺的通信是迭代進行的，直到生成整個句子為止。我們的框架無縫地橋接了視覺感知和LLMs之間的差距，在組合推理基準測試中表現遠超過以往的VLMs（例如，在HICO-DET mAP上提高約20％，在Cola頂部1準確度上提高約14％，在ARO頂部1準確度上提高約3％）。我們還在傳統的視覺語言任務上取得了最先進的表現，如指代表達理解和視覺問答。

English

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

CoVLM：透過溝通式解碼在大型語言模型中組合視覺實體和關係

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

摘要

Support