視覚に理性をもたらす：モデル統合を通じた知覚と推論の理解

要旨

Vision-Language Models（VLM）は、視覚的知覚とLarge Language Models（LLM）の推論能力などの汎用的な能力を組み合わせたものである。しかし、これら二つの能力をどのように組み合わせ、貢献させるかについてのメカニズムは十分に理解されていない。本研究では、異なるモデルのパラメータを接続するモデルマージを通じて、知覚と推論を構成する方法を探る。従来の研究がしばしば同種のモデルのマージに焦点を当ててきたのに対し、我々は異なるモダリティ間のモデルマージを提案し、LLMの推論能力をVLMに組み込むことを可能にする。大規模な実験を通じて、モデルマージがトレーニング不要の方法でLLMからVLMへ推論能力を転送する成功した経路を提供することを実証する。さらに、マージされたモデルを利用して、知覚と推論の内部メカニズムと、マージがそれにどのように影響するかを理解する。知覚能力は主にモデルの初期層にエンコードされているのに対し、推論は主に中盤から後半の層によって促進されることがわかる。マージ後、すべての層が推論に貢献し始める一方で、知覚能力の層間分布はほとんど変化しないことが観察される。これらの観察結果は、マルチモーダル統合と解釈のためのツールとしてのモデルマージの可能性を明らかにする。

English

Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.

視覚に理性をもたらす：モデル統合を通じた知覚と推論の理解

Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging

要旨

Support