複数画像生成による言語モデルの視覚的常識の改善

要旨

常識推論は本質的にマルチモーダルな知識に基づいています。しかし、既存の大規模言語モデル（LLM）は主にテキストデータのみで訓練されており、重要な視覚情報を取り込む能力が制限されています。一方、視覚指向タスクに優れた視覚言語モデルは、基本的な常識推論などの非視覚タスクではしばしば失敗します。この乖離は、堅牢な視覚理解とテキストベースの言語推論の統合という重要な課題を浮き彫りにしています。この目的のために、我々はLLMの視覚的常識を強化する手法を提案します。具体的には、入力テキストプロンプトに基づいて複数の画像を生成し、それらの予測確率を混合することでモデルの意思決定プロセスに統合します。マルチモーダルに基づいた言語モデリングを促進するため、投影された視覚的特徴とテキストのみで条件付けられた事前訓練済みLLMの出力を組み合わせる後期融合層を採用します。この後期融合層により、包括的な画像-テキスト知識に基づく予測と、必要に応じてテキストのみに基づく予測が可能になります。我々のアプローチを、従来のNLPタスク（常識推論や読解を含む）とともにいくつかの視覚的常識推論タスクを用いて評価します。実験結果は、既存のベースラインを大幅に上回る優位性を示しています。最新の最先端LLM（例：Llama3）に適用した場合、視覚的常識だけでなく従来のNLPベンチマークでも改善が観察されました。コードとモデルはhttps://github.com/guyyariv/vLMIGで公開されています。

English

Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlights a critical challenge - the integration of robust visual understanding with foundational text-based language reasoning. To this end, we introduce a method aimed at enhancing LLMs' visual commonsense. Specifically, our method generates multiple images based on the input text prompt and integrates these into the model's decision-making process by mixing their prediction probabilities. To facilitate multimodal grounded language modeling, we employ a late-fusion layer that combines the projected visual features with the output of a pre-trained LLM conditioned on text only. This late-fusion layer enables predictions based on comprehensive image-text knowledge as well as text only when this is required. We evaluate our approach using several visual commonsense reasoning tasks together with traditional NLP tasks, including common sense reasoning and reading comprehension. Our experimental results demonstrate significant superiority over existing baselines. When applied to recent state-of-the-art LLMs (e.g., Llama3), we observe improvements not only in visual common sense but also in traditional NLP benchmarks. Code and models are available under https://github.com/guyyariv/vLMIG.

複数画像生成による言語モデルの視覚的常識の改善

Improving Visual Commonsense in Language Models via Multiple Image Generation

要旨

Support