多模态大型語言模型的視覺表徵對齊

摘要

透過視覺指令調校訓練的多模態大型語言模型（MLLMs）在多樣任務中展現了卓越的性能，然而在物件計數或空間推理等以視覺為核心的任務上仍顯不足。我們將此差距歸因於當前主流的僅文本監督範式，該範式僅為視覺通路提供間接指導，常導致MLLMs在訓練過程中丟失細粒度的視覺細節。本文提出視覺表徵對齊（VIRAL），一種簡潔而有效的正則化策略，旨在將MLLMs內部的視覺表徵與預訓練視覺基礎模型（VFMs）的表徵對齊。透過明確實施這一對齊，VIRAL不僅使模型能夠保留來自輸入視覺編碼器的關鍵視覺細節，還能補充來自VFMs的額外視覺知識，從而增強其處理複雜視覺輸入的推理能力。我們的實驗在多模態基準測試的所有任務上均顯示出持續的改進。此外，我們進行了全面的消融研究，以驗證框架背後關鍵設計選擇的有效性。我們相信，這一簡明發現為在訓練MLLMs中有效整合視覺信息開闢了重要方向。

English

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

多模态大型語言模型的視覺表徵對齊

Visual Representation Alignment for Multimodal Large Language Models

摘要

Support