MLLM에서의 학습 없이 추론과 성찰

초록

최근 추론 대형 언어 모델(LLM, 예: DeepSeek-R1 및 OpenAI-o1)의 발전은 강화 학습을 통해 인상적인 추론 능력을 보여주었습니다. 그러나 이러한 능력을 다중모달 대형 언어 모델(MLLM)로 확장하는 것은 재훈련의 과도한 비용과 고품질의 검증 가능한 다중모달 추론 데이터셋의 부족으로 인해 어려움을 겪고 있습니다. 본 논문은 FRANK 모델을 소개합니다. 이는 기존의 MLLM에 어떠한 그래디언트 업데이트나 추가 감독 없이도 추론 및 반성 능력을 부여하는 훈련이 필요 없는 r1과 유사한 MLLM입니다. 우리의 핵심 통찰은 MLLM 디코더 계층 간의 인지와 추론을 분리하는 것입니다. 구체적으로, 우리는 깊은 디코더 계층에 비해 얕은 디코더 계층이 시각적 토큰에 더 많은 주의를 할당하는 반면, 깊은 디코더 계층은 텍스트 의미에 집중한다는 것을 관찰했습니다. 이 관찰은 시각적으로 사전 훈련된 MLLM과 추론에 특화된 LLM을 결합하는 계층적 가중치 병합 접근법을 동기 부여합니다. 이를 위해, 우리는 깊은 디코더 계층에 추론 능력을 통합하면서 얕은 디코더 계층에서 시각적 기반을 보존하는 테일러 도출 폐쇄형 융합 메커니즘을 제안합니다. 도전적인 다중모달 추론 벤치마크에 대한 광범위한 실험은 우리의 접근법의 효과를 입증합니다. MMMU 벤치마크에서, 우리의 모델 FRANK-38B는 69.2의 정확도를 달성하여 가장 강력한 베이스라인인 InternVL2.5-38B를 +5.3점 앞섰으며, 심지어 독점 모델인 GPT-4o를 능가했습니다. 우리의 프로젝트 홈페이지는 http://iip.whu.edu.cn/frank/index.html에서 확인할 수 있습니다.

English

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

MLLM에서의 학습 없이 추론과 성찰

Training-Free Reasoning and Reflection in MLLMs

초록

Support