以理性引導視覺:透過模型融合理解感知與推理
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
May 8, 2025
作者: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
cs.AI
摘要
視覺語言模型(VLMs)將視覺感知與大型語言模型(LLMs)的通用能力(如推理)相結合。然而,這兩種能力如何結合並發揮作用的機制仍鮮為人知。在本研究中,我們探索通過模型合併來構建感知與推理,這種方法連接了不同模型的參數。與以往研究多專注於同類模型的合併不同,我們提出跨模態的模型合併,從而將LLMs的推理能力融入VLMs中。通過大量實驗,我們證明模型合併提供了一種無需訓練即可將推理能力從LLMs轉移至VLMs的成功途徑。此外,我們利用合併後的模型來理解感知與推理的內部機制,以及合併如何影響這些機制。我們發現,感知能力主要編碼於模型的早期層,而推理則主要由中後層促進。合併後,我們觀察到所有層開始對推理有所貢獻,而感知能力在各層的分佈則基本保持不變。這些發現揭示了模型合併作為多模態整合與解釋工具的潛力。
English
Vision-Language Models (VLMs) combine visual perception with the general
capabilities, such as reasoning, of Large Language Models (LLMs). However, the
mechanisms by which these two abilities can be combined and contribute remain
poorly understood. In this work, we explore to compose perception and reasoning
through model merging that connects parameters of different models. Unlike
previous works that often focus on merging models of the same kind, we propose
merging models across modalities, enabling the incorporation of the reasoning
capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate
that model merging offers a successful pathway to transfer reasoning abilities
from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged
models to understand the internal mechanism of perception and reasoning and how
merging affects it. We find that perception capabilities are predominantly
encoded in the early layers of the model, whereas reasoning is largely
facilitated by the middle-to-late layers. After merging, we observe that all
layers begin to contribute to reasoning, whereas the distribution of perception
abilities across layers remains largely unchanged. These observations shed
light on the potential of model merging as a tool for multimodal integration
and interpretation.