CoVLM: Visuele entiteiten en relaties samenstellen in grote taalmodellen via communicatieve decodering

Samenvatting

Een opmerkelijke vaardigheid van mensen ligt in compositioneel redeneren, d.w.z. het vermogen om "oneindig gebruik te maken van eindige middelen". Huidige grote vision-language foundation modellen (VLMs) schieten echter tekort in dergelijke compositionele vaardigheden vanwege hun "bag-of-words" gedrag en het onvermogen om woorden te construeren die visuele entiteiten en de relaties daartussen correct weergeven. Daarom stellen we CoVLM voor, dat de LLM kan begeleiden om expliciet visuele entiteiten en relaties in de tekst te componeren en dynamisch te communiceren met de vision encoder en het detectienetwerk om vision-language communicatieve decodering te bereiken. Specifiek ontwikkelen we eerst een set nieuwe communicatietokens voor de LLM, voor dynamische communicatie tussen het visuele detectiesysteem en het taalsysteem. Een communicatietoken wordt gegenereerd door de LLM na een visuele entiteit of relatie, om het detectienetwerk te informeren om regio's voor te stellen die relevant zijn voor de tot nu toe gegenereerde zin. De voorgestelde regio's van belang (ROIs) worden vervolgens teruggevoerd naar de LLM voor betere taalgeneratie afhankelijk van de relevante regio's. De LLM is zo in staat om de visuele entiteiten en relaties te componeren via de communicatietokens. De communicatie van vision naar taal en van taal naar vision wordt iteratief uitgevoerd totdat de hele zin is gegenereerd. Ons framework overbrugt naadloos de kloof tussen visuele perceptie en LLM's en overtreft eerdere VLMs met een grote marge op compositionele redeneerbenchmarks (bijv. ~20% in HICO-DET mAP, ~14% in Cola top-1 nauwkeurigheid, en ~3% op ARO top-1 nauwkeurigheid). We behalen ook state-of-the-art prestaties op traditionele vision-language taken zoals referring expression comprehension en visual question answering.

English

A remarkable ability of human beings resides in compositional reasoning, i.e., the capacity to make "infinite use of finite means". However, current large vision-language foundation models (VLMs) fall short of such compositional abilities due to their "bag-of-words" behaviors and inability to construct words that correctly represent visual entities and the relations among the entities. To this end, we propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text and dynamically communicate with the vision encoder and detection network to achieve vision-language communicative decoding. Specifically, we first devise a set of novel communication tokens for the LLM, for dynamic communication between the visual detection system and the language system. A communication token is generated by the LLM following a visual entity or a relation, to inform the detection network to propose regions that are relevant to the sentence generated so far. The proposed regions-of-interests (ROIs) are then fed back into the LLM for better language generation contingent on the relevant regions. The LLM is thus able to compose the visual entities and relationships through the communication tokens. The vision-to-language and language-to-vision communication are iteratively performed until the entire sentence is generated. Our framework seamlessly bridges the gap between visual perception and LLMs and outperforms previous VLMs by a large margin on compositional reasoning benchmarks (e.g., ~20% in HICO-DET mAP, ~14% in Cola top-1 accuracy, and ~3% on ARO top-1 accuracy). We also achieve state-of-the-art performances on traditional vision-language tasks such as referring expression comprehension and visual question answering.

CoVLM: Visuele entiteiten en relaties samenstellen in grote taalmodellen via communicatieve decodering

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Samenvatting

Support