ChatPaper.aiChatPaper

BRAVE:拓宽视觉-语言模型的视觉编码

BRAVE: Broadening the visual encoding of vision-language models

April 10, 2024
作者: Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari
cs.AI

摘要

视觉-语言模型(VLMs)通常由视觉编码器(例如CLIP)和一个语言模型(LM)组成,后者解释编码特征以解决下游任务。尽管取得了显著进展,但由于视觉编码器的能力有限,VLMs存在一些缺点,例如对某些图像特征的“盲目性”、视觉幻觉等。为了解决这些问题,我们研究拓展VLMs的视觉编码能力。我们首先全面评估了几种具有不同归纳偏见的视觉编码器在解决VLM任务时的表现。我们观察到没有一种编码配置能在不同任务中始终取得最佳性能,具有不同偏见的编码器可以表现出惊人的相似性。受此启发,我们引入了一种名为BRAVE的方法,将多个冻结编码器的特征整合成更通用的表示形式,可以直接作为冻结LM的输入。BRAVE在广泛的字幕生成和视觉问答基准上实现了最先进的性能,并显著减少了VLMs的前述问题,同时需要比现有方法更少的可训练参数,并具有更紧凑的表示形式。我们的结果突显了将不同的视觉偏见纳入VLMs以获得更广泛和上下文化视觉理解的潜力。
English
Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders, e.g. "blindness" to certain image features, visual hallucination, etc. To address these issues, we study broadening the visual encoding capabilities of VLMs. We first comprehensively benchmark several vision encoders with different inductive biases for solving VLM tasks. We observe that there is no single encoding configuration that consistently achieves top performance across different tasks, and encoders with different biases can perform surprisingly similarly. Motivated by this, we introduce a method, named BRAVE, that consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM. BRAVE achieves state-of-the-art performance on a broad range of captioning and VQA benchmarks and significantly reduces the aforementioned issues of VLMs, while requiring a smaller number of trainable parameters than existing methods and having a more compressed representation. Our results highlight the potential of incorporating different visual biases for a more broad and contextualized visual understanding of VLMs.

Summary

AI-Generated Summary

PDF191December 15, 2024