CoVLM: コミュニカティブデコーディングによる大規模言語モデルにおける視覚的エンティティと関係性の構成

要旨

人間の驚くべき能力の一つに、合成的推論、すなわち「有限の手段を無限に活用する」能力が存在する。しかし、現在の大規模視覚言語基盤モデル（VLMs）は、その「単語の袋」的な振る舞いや、視覚的実体とそれらの関係を正しく表現する単語を構築できないため、このような合成的能力に欠けている。そこで我々は、CoVLMを提案する。CoVLMは、LLM（大規模言語モデル）を導いて視覚的実体とそれらの関係を明示的にテキスト内で構成し、視覚エンコーダーと検出ネットワークと動的に通信することで、視覚言語的デコードを実現する。具体的には、まず、視覚検出システムと言語システム間の動的通信のための一連の新規な通信トークンをLLM用に考案する。通信トークンは、視覚的実体や関係に続いてLLMによって生成され、検出ネットワークに対して、これまでに生成された文に関連する領域を提案するよう通知する。提案された関心領域（ROIs）は、その後、関連する領域に基づいたより良い言語生成のためにLLMにフィードバックされる。これにより、LLMは通信トークンを介して視覚的実体と関係を構成することが可能となる。視覚から言語への通信と言語から視覚への通信は、文全体が生成されるまで繰り返し行われる。我々のフレームワークは、視覚的知覚とLLMの間のギャップをシームレスに埋め、合成的推論ベンチマークにおいて従来のVLMsを大幅に上回る性能を発揮する（例：HICO-DET mAPで約20%、Cola top-1精度で約14%、ARO top-1精度で約3%の向上）。また、参照表現理解や視覚的質問応答などの従来の視覚言語タスクにおいても、最先端の性能を達成している。

English

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

CoVLM: コミュニカティブデコーディングによる大規模言語モデルにおける視覚的エンティティと関係性の構成

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

要旨

Support