大規模言語およびビジョンモデルの興味深い特性

要旨

最近、大規模言語およびビジョンモデル（LLVMs）は、知覚と認知能力を必要とする幅広いタスクにおいて顕著な汎化性能を発揮し、注目と開発の努力を受けています。彼らの成功の鍵となる要因は、ビジョンエンコーダ、プロジェクタ、そして大規模言語モデル（LLM）から構成されるシンプルなアーキテクチャです。高度な推論タスクでの成果にもかかわらず、基本的な知覚関連タスク（例：MMVP）でのパフォーマンスは驚くほど低いままです。この相違は、LLVMが画像をどのように認識し、ビジョンエンコーダの利点をどのように活用しているかという問題を提起しています。この問題に取り組むため、我々はいくつかの側面に関してこの問いに系統的に調査し、順列不変性、頑健性、数学的推論、アラインメントの保持と重要性などを評価することで、最も一般的なLLVMファミリー（すなわちLLaVA）を10の評価ベンチマークで評価しました。我々の包括的な実験により、現在のLLVMのいくつかの興味深い特性が明らかになりました：（1）視覚パッチの順序がランダムに置換された場合でも、彼らは画像をグローバルに内部処理する；（2）時折、詳細な数値情報を完全に認識することなく数学問題を解決することができる；（3）クロスモーダルアラインメントは複雑な推論タスクに過学習しており、それにより、彼らはビジョンエンコーダの元々の知覚能力の一部を失ってしまう；（4）下位層の表現空間（25％未満）は、パフォーマンスを決定し視覚理解を向上させる上で重要な役割を果たしています。最後に、上記の観察に基づき、より優れたLLVMの構築とより厳しい評価ベンチマークの構築に向けた潜在的な将来方向を提案しています。

English

Recently, large language and vision models (LLVMs) have received significant attention and development efforts due to their remarkable generalization performance across a wide range of tasks requiring perception and cognitive abilities. A key factor behind their success is their simple architecture, which consists of a vision encoder, a projector, and a large language model (LLM). Despite their achievements in advanced reasoning tasks, their performance on fundamental perception-related tasks (e.g., MMVP) remains surprisingly low. This discrepancy raises the question of how LLVMs truly perceive images and exploit the advantages of the vision encoder. To address this, we systematically investigate this question regarding several aspects: permutation invariance, robustness, math reasoning, alignment preserving and importance, by evaluating the most common LLVM's families (i.e., LLaVA) across 10 evaluation benchmarks. Our extensive experiments reveal several intriguing properties of current LLVMs: (1) they internally process the image in a global manner, even when the order of visual patch sequences is randomly permuted; (2) they are sometimes able to solve math problems without fully perceiving detailed numerical information; (3) the cross-modal alignment is overfitted to complex reasoning tasks, thereby, causing them to lose some of the original perceptual capabilities of their vision encoder; (4) the representation space in the lower layers (<25%) plays a crucial role in determining performance and enhancing visual understanding. Lastly, based on the above observations, we suggest potential future directions for building better LLVMs and constructing more challenging evaluation benchmarks.

大規模言語およびビジョンモデルの興味深い特性

Intriguing Properties of Large Language and Vision Models

要旨

Support