다중모달 대형 언어 모델을 위한 시각적 표현 정렬

초록

시각적 지시 튜닝으로 학습된 다중모달 대형 언어 모델(MLLMs)은 다양한 작업에서 강력한 성능을 달성했지만, 객체 카운팅이나 공간 추론과 같은 시각 중심 작업에서는 여전히 한계를 보입니다. 우리는 이러한 격차가 주로 텍스트 전용 감독 패러다임에서 비롯된다고 보고 있습니다. 이 패러다임은 시각적 경로에 간접적인 지침만 제공하며, 종종 MLLMs가 학습 과정에서 세밀한 시각적 세부 사항을 버리게 만듭니다. 본 논문에서는 VIsual Representation ALignment(VIRAL)을 제안합니다. 이는 MLLMs의 내부 시각적 표현을 사전 학습된 시각 기반 모델(VFMs)의 표현과 정렬하는 간단하지만 효과적인 정규화 전략입니다. 이러한 정렬을 명시적으로 강제함으로써, VIRAL은 모델이 입력 시각 인코더로부터 중요한 시각적 세부 사항을 유지할 뿐만 아니라 VFMs로부터 추가적인 시각적 지식을 보완할 수 있게 하여, 복잡한 시각적 입력에 대한 추론 능력을 향상시킵니다. 우리의 실험은 널리 사용되는 다중모달 벤치마크에서 모든 작업에 걸쳐 일관된 개선을 보여줍니다. 또한, 우리는 프레임워크의 핵심 설계 선택을 검증하기 위해 포괄적인 절제 연구를 수행했습니다. 우리는 이 간단한 발견이 MLLMs 학습에서 시각적 정보의 효과적인 통합을 위한 중요한 방향을 열어준다고 믿습니다.

English

Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

다중모달 대형 언어 모델을 위한 시각적 표현 정렬

Visual Representation Alignment for Multimodal Large Language Models

초록

Support