BRAVE:擴展視覺編碼的視覺語言模型
BRAVE: Broadening the visual encoding of vision-language models
April 10, 2024
作者: Oğuzhan Fatih Kar, Alessio Tonioni, Petra Poklukar, Achin Kulshrestha, Amir Zamir, Federico Tombari
cs.AI
摘要
視覺語言模型(VLMs)通常由視覺編碼器(例如CLIP)和一個語言模型(LM)組成,該模型解釋編碼特徵以解決下游任務。儘管取得了顯著進展,但由於視覺編碼器的能力有限,VLMs存在一些缺點,例如對某些圖像特徵的“盲點”、視覺幻覺等。為了解決這些問題,我們研究擴展VLMs的視覺編碼能力。我們首先全面評估了幾個具有不同歸納偏見的視覺編碼器,以解決VLM任務。我們觀察到沒有單一的編碼配置能夠在不同任務中始終達到最佳性能,並且具有不同偏見的編碼器可以表現出令人驚訝的相似性。受此啟發,我們引入了一種名為BRAVE的方法,將多個凍結編碼器的特徵整合成更多功能的表示形式,可以直接作為凍結LM的輸入。BRAVE在各種標題生成和視覺問答基準測試中實現了最先進的性能,顯著減少了VLMs的上述問題,同時比現有方法需要更少的可訓練參數並具有更緊湊的表示形式。我們的結果突顯了將不同的視覺偏見納入VLMs以獲得更廣泛和情境化的視覺理解潛力。
English
Vision-language models (VLMs) are typically composed of a vision encoder,
e.g. CLIP, and a language model (LM) that interprets the encoded features to
solve downstream tasks. Despite remarkable progress, VLMs are subject to
several shortcomings due to the limited capabilities of vision encoders, e.g.
"blindness" to certain image features, visual hallucination, etc. To address
these issues, we study broadening the visual encoding capabilities of VLMs. We
first comprehensively benchmark several vision encoders with different
inductive biases for solving VLM tasks. We observe that there is no single
encoding configuration that consistently achieves top performance across
different tasks, and encoders with different biases can perform surprisingly
similarly. Motivated by this, we introduce a method, named BRAVE, that
consolidates features from multiple frozen encoders into a more versatile
representation that can be directly fed as the input to a frozen LM. BRAVE
achieves state-of-the-art performance on a broad range of captioning and VQA
benchmarks and significantly reduces the aforementioned issues of VLMs, while
requiring a smaller number of trainable parameters than existing methods and
having a more compressed representation. Our results highlight the potential of
incorporating different visual biases for a more broad and contextualized
visual understanding of VLMs.Summary
AI-Generated Summary