为视觉注入理性：通过模型融合理解感知与推理

摘要

视觉-语言模型（VLMs）将视觉感知能力与大型语言模型（LLMs）的通用功能（如推理）相结合。然而，这两种能力如何整合并相互作用的机制仍不甚明了。在本研究中，我们探索通过模型融合来组合感知与推理，这种融合连接了不同模型的参数。与以往主要关注同类模型融合的研究不同，我们提出跨模态的模型融合方法，使得LLMs的推理能力能够融入VLMs中。通过大量实验，我们证明模型融合提供了一条无需额外训练即可将LLMs的推理能力迁移至VLMs的有效途径。此外，我们利用融合后的模型深入理解感知与推理的内部机制，以及融合对其产生的影响。研究发现，感知能力主要编码于模型的早期层，而推理则主要由中后期层促进。融合后，我们观察到所有层都开始对推理做出贡献，而感知能力在各层的分布基本保持不变。这些发现揭示了模型融合作为多模态集成与解释工具的潜力。

English

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

为视觉注入理性：通过模型融合理解感知与推理

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

摘要

Support