비전에 논리를 더하라: 모델 병합을 통한 지각과 추론의 이해

초록

비전-언어 모델(VLMs)은 시각적 인식 능력을 대형 언어 모델(LLMs)의 추론과 같은 일반적인 능력과 결합합니다. 그러나 이 두 능력이 어떻게 결합되고 기여할 수 있는지에 대한 메커니즘은 여전히 잘 이해되지 않고 있습니다. 본 연구에서는 서로 다른 모델의 파라미터를 연결하는 모델 병합을 통해 인식과 추론을 구성하는 방법을 탐구합니다. 동일한 종류의 모델을 병합하는 데 초점을 맞춘 기존 연구와 달리, 우리는 모달리티 간의 모델 병합을 제안하여 LLMs의 추론 능력을 VLMs에 통합할 수 있도록 합니다. 광범위한 실험을 통해, 우리는 모델 병합이 학습 없이도 LLMs의 추론 능력을 VLMs로 전달하는 성공적인 경로를 제공한다는 것을 입증합니다. 또한, 병합된 모델을 활용하여 인식과 추론의 내부 메커니즘과 병합이 이를 어떻게 영향을 미치는지 이해합니다. 우리는 인식 능력이 주로 모델의 초기 층에 인코딩되는 반면, 추론은 중간에서 후반 층에서 크게 촉진된다는 것을 발견했습니다. 병합 후, 모든 층이 추론에 기여하기 시작하는 반면, 층 간의 인식 능력 분포는 크게 변하지 않았습니다. 이러한 관찰은 다중 모달리티 통합 및 해석을 위한 도구로서 모델 병합의 잠재력을 밝혀줍니다.

English

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

비전에 논리를 더하라: 모델 병합을 통한 지각과 추론의 이해

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

초록

Support