通過多圖像生成改善語言模型中的視覺常識

摘要

常識推理基本上是基於多模態知識。然而，現有的大型語言模型（LLMs）主要是使用文本數據進行訓練，限制了它們整合基本視覺信息的能力。相比之下，擅長視覺導向任務的視覺語言模型在非視覺任務，如基本常識推理方面通常表現不佳。這種分歧突顯了一個關鍵挑戰 - 將強大的視覺理解與基礎的基於文本的語言推理相結合。為此，我們提出了一種旨在增強LLMs視覺常識的方法。具體而言，我們的方法基於輸入文本提示生成多個圖像，並通過混合它們的預測概率將其整合到模型的決策過程中。為了促進多模態基礎語言建模，我們使用了一個後融合層，將投影的視覺特徵與僅條件於文本的預訓練LLM的輸出結合。這個後融合層使得可以基於全面的圖像-文本知識進行預測，同時在需要時僅使用文本。我們使用幾個視覺常識推理任務以及傳統的自然語言處理任務來評估我們的方法，包括常識推理和閱讀理解。我們的實驗結果表明，我們的方法明顯優於現有的基準。當應用於最新的頂尖LLMs（例如Llama3）時，我們觀察到不僅在視覺常識方面有所改善，而且在傳統的自然語言處理基準上也有所提升。代碼和模型可在https://github.com/guyyariv/vLMIG 下載。

English

Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under https://github.com/guyyariv/vLMIG.

通過多圖像生成改善語言模型中的視覺常識

Improving Visual Commonsense in Language Models via Multiple Image Generation

摘要

Support