通过多图像生成改善语言模型中的视觉常识

摘要

常识推理基本上是基于多模态知识的。然而，现有的大型语言模型（LLMs）主要是使用文本数据进行训练的，限制了它们整合基本视觉信息的能力。相比之下，擅长视觉任务的视觉语言模型在非视觉任务，如基本常识推理方面经常失败。这种分歧凸显了一个关键挑战 - 将强大的视觉理解与基于文本的语言推理相融合。为此，我们提出了一种旨在增强LLMs视觉常识的方法。具体而言，我们的方法基于输入文本提示生成多个图像，并通过混合它们的预测概率将其整合到模型的决策过程中。为促进多模态基础语言建模，我们采用了一个迟到融合层，将投影的视觉特征与仅以文本为条件的预训练LLM的输出结合起来。这个迟到融合层使得在需要时可以基于全面的图像文本知识以及仅文本进行预测。我们使用几个视觉常识推理任务以及传统的自然语言处理任务来评估我们的方法，包括常识推理和阅读理解。我们的实验结果显示出明显优于现有基线的优势。当应用于最近的最先进LLMs（例如Llama3）时，我们观察到不仅在视觉常识方面有改进，而且在传统的自然语言处理基准上也有改进。代码和模型可在https://github.com/guyyariv/vLMIG 下载。

English

Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under https://github.com/guyyariv/vLMIG.

通过多图像生成改善语言模型中的视觉常识

Improving Visual Commonsense in Language Models via Multiple Image Generation

摘要

Support