マルチモーダル大規模言語モデルのための視覚表現アラインメント

要旨

視覚的指示チューニングで訓練されたマルチモーダル大規模言語モデル（MLLM）は、多様なタスクで高い性能を達成していますが、物体のカウントや空間推論などの視覚中心のタスクでは依然として限界があります。このギャップは、主流のテキストのみの監視パラダイムに起因すると考えられます。このパラダイムは視覚経路に対して間接的なガイダンスしか提供せず、MLLMが訓練中に細かな視覚的詳細を捨ててしまうことが多いためです。本論文では、VIsual Representation ALignment（VIRAL）を提案します。これは、MLLMの内部視覚表現を事前訓練された視覚基盤モデル（VFM）の表現と整合させる、シンプルでありながら効果的な正則化戦略です。この整合を明示的に強制することで、VIRALはモデルが入力視覚エンコーダから重要な視覚的詳細を保持するだけでなく、VFMからの追加の視覚的知識を補完し、複雑な視覚入力を推論する能力を向上させます。我々の実験は、広く採用されているマルチモーダルベンチマークの全てのタスクで一貫した改善を示しています。さらに、我々のフレームワークの基盤となる主要な設計選択を検証するために、包括的なアブレーション研究を実施しました。このシンプルな発見が、MLLMの訓練における視覚情報の効果的な統合に向けた重要な方向性を開くものと信じています。

English

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

マルチモーダル大規模言語モデルのための視覚表現アラインメント

Visual Representation Alignment for Multimodal Large Language Models

要旨

Support