多模态大语言模型的视觉表征对齐

摘要

通过视觉指令调优训练的多模态大语言模型（MLLMs）已在多种任务中展现出强劲性能，但在以视觉为中心的任务如物体计数或空间推理方面仍显不足。我们将此差距归因于当前主流的纯文本监督范式，该范式仅为视觉路径提供间接指导，常导致MLLMs在训练过程中忽略细粒度的视觉细节。本文提出视觉表示对齐（VIRAL），一种简洁而有效的正则化策略，旨在将MLLMs的内部视觉表示与预训练视觉基础模型（VFMs）的表示对齐。通过显式实施这种对齐，VIRAL不仅使模型能够保留来自输入视觉编码器的关键视觉细节，还能补充VFMs提供的额外视觉知识，从而增强其处理复杂视觉输入时的推理能力。我们的实验表明，在广泛采用的多模态基准测试中，所有任务均实现了持续改进。此外，我们进行了全面的消融研究，以验证框架设计的关键选择。我们相信，这一简单发现为在MLLMs训练中有效整合视觉信息开辟了重要方向。

English

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

多模态大语言模型的视觉表征对齐

Visual Representation Alignment for Multimodal Large Language Models

摘要

Support