CoVLM: 커뮤니케이티브 디코딩을 통한 대규모 언어 모델에서의 시각적 개체 및 관계 구성

초록

인간의 놀라운 능력 중 하나는 조합적 추론, 즉 "유한한 수단을 무한히 활용"할 수 있는 능력에 있습니다. 그러나 현재의 대형 시각-언어 기반 모델(VLMs)은 "단어 모음(bag-of-words)" 방식의 행동과 시각적 개체 및 개체 간의 관계를 올바르게 표현하는 단어를 구성하지 못하는 한계로 인해 이러한 조합적 능력이 부족합니다. 이를 해결하기 위해, 우리는 CoVLM을 제안합니다. CoVLM은 대형 언어 모델(LLM)이 텍스트 내에서 시각적 개체와 관계를 명시적으로 조합하도록 유도하고, 시각 인코더 및 검출 네트워크와 동적으로 소통하여 시각-언어 소통적 디코딩을 달성합니다. 구체적으로, 우리는 먼저 LLM을 위한 일련의 새로운 소통 토큰을 설계하여 시각 검출 시스템과 언어 시스템 간의 동적 소통을 가능하게 합니다. 소통 토큰은 시각적 개체나 관계 뒤에 LLM에 의해 생성되며, 검출 네트워크에 지금까지 생성된 문장과 관련된 영역을 제안하도록 알립니다. 제안된 관심 영역(ROIs)은 이후 LLM에 다시 입력되어 관련 영역에 기반한 더 나은 언어 생성을 가능하게 합니다. 이를 통해 LLM은 소통 토큰을 통해 시각적 개체와 관계를 조합할 수 있습니다. 시각-언어 및 언어-시각 소통은 전체 문장이 생성될 때까지 반복적으로 수행됩니다. 우리의 프레임워크는 시각적 인식과 LLM 간의 간극을 원활하게 연결하며, 조합적 추론 벤치마크에서 이전의 VLMs을 크게 능가합니다(예: HICO-DET mAP에서 ~20%, Cola top-1 정확도에서 ~14%, ARO top-1 정확도에서 ~3%). 또한, 참조 표현 이해 및 시각적 질문 응답과 같은 전통적인 시각-언어 작업에서도 최첨단 성능을 달성합니다.

English

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

CoVLM: 커뮤니케이티브 디코딩을 통한 대규모 언어 모델에서의 시각적 개체 및 관계 구성

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

초록

Support