ChatPaper.aiChatPaper

无需训练的多模态大语言模型推理与反思

Training-Free Reasoning and Reflection in MLLMs

May 22, 2025
作者: Hongchen Wei, Zhenzhong Chen
cs.AI

摘要

近期,推理型大语言模型(如DeepSeek-R1和OpenAI-o1)通过强化学习展现了卓越的推理能力。然而,将这些能力扩展至多模态大语言模型(MLLMs)却面临重重挑战,主要在于重新训练的高昂成本以及高质量、可验证的多模态推理数据集的稀缺。本文提出了FRANK模型,一种无需训练、类似R1的多模态大语言模型,它能够赋予现成的MLLMs推理与反思能力,而无需任何梯度更新或额外监督。我们的核心洞见在于,将MLLM解码器层中的感知与推理功能解耦。具体而言,我们发现相较于深层解码器,浅层解码器对视觉标记分配了更多注意力,而深层解码器则更专注于文本语义。这一观察启发我们采用一种分层权重融合方法,将视觉预训练的MLLM与专门用于推理的LLM相结合。为此,我们提出了一种基于泰勒展开的逐层闭式融合机制,该机制在保持浅层解码器视觉基础的同时,将推理能力整合到深层解码器中。在多项具有挑战性的多模态推理基准测试中,广泛的实验验证了我们方法的有效性。在MMMU基准测试中,我们的FRANK-38B模型以69.2的准确率超越了最强基线InternVL2.5-38B,提升了+5.3,甚至超过了专有的GPT-4o模型。项目主页请访问:http://iip.whu.edu.cn/frank/index.html。
English
Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

Summary

AI-Generated Summary

PDF73May 23, 2025