多模態大語言模型中的無訓練推理與反思

摘要

近期，推理型大語言模型（如DeepSeek-R1和OpenAI-o1）通過強化學習展現了令人矚目的推理能力。然而，將這些能力擴展到多模態大語言模型（MLLMs）卻面臨著高昂的重新訓練成本和缺乏高質量、可驗證的多模態推理數據集的挑戰。本文介紹了FRANK模型，這是一種無需訓練且類似於R1的多模態大語言模型，它賦予現成的MLLMs推理和反思能力，無需任何梯度更新或額外監督。我們的關鍵洞察是將感知與推理在MLLM的解碼器層中解耦。具體而言，我們觀察到，與深層解碼器層相比，淺層解碼器層更多地關注視覺標記，而深層解碼器層則集中於文本語義。這一觀察激發了一種分層權重合併方法，該方法將視覺預訓練的MLLM與專門用於推理的LLM相結合。為此，我們提出了一種基於泰勒展開的層級閉合形式融合機制，該機制將推理能力整合到深層解碼器層，同時在淺層解碼器層保留視覺基礎。在具有挑戰性的多模態推理基準上的廣泛實驗證明了我們方法的有效性。在MMMU基準測試中，我們的模型FRANK-38B達到了69.2的準確率，比最強的基線InternVL2.5-38B高出+5.3，甚至超越了專有的GPT-4o模型。我們的項目主頁位於：http://iip.whu.edu.cn/frank/index.html。

English

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

多模態大語言模型中的無訓練推理與反思

Training-Free Reasoning and Reflection in MLLMs

摘要

Support