MLLMにおけるトレーニング不要の推論とリフレクション

要旨

最近の推論LLM（例：DeepSeek-R1やOpenAI-o1）の進展は、強化学習を通じて印象的な推論能力を示しています。しかし、これらの能力をマルチモーダルLLM（MLLM）に拡張することは、再トレーニングの莫大なコストや、高品質で検証可能なマルチモーダル推論データセットの不足によって妨げられています。本論文では、FRANKモデルを紹介します。これは、トレーニングを必要とせず、既存のMLLMに推論と反省能力を付与するr1ライクなMLLMであり、勾配更新や追加の教師信号を一切必要としません。私たちの重要な洞察は、MLLMのデコーダ層間で知覚と推論を分離することです。具体的には、浅いデコーダ層は視覚トークンにより多くの注意を割り当てるのに対し、深いデコーダ層はテキストの意味に集中することを観察しました。この観察に基づき、視覚事前学習済みMLLMと推論特化LLMを組み合わせる階層的重みマージングアプローチを提案します。これにより、深いデコーダ層に推論能力を統合しつつ、浅いデコーダ層での視覚的基盤を保持する、テイラー展開に基づく層ごとの閉形式融合メカニズムを開発しました。挑戦的なマルチモーダル推論ベンチマークでの広範な実験により、本手法の有効性を実証しました。MMMUベンチマークでは、FRANK-38Bモデルが69.2の精度を達成し、最強のベースラインであるInternVL2.5-38Bを+5.3上回り、プロプライエタリなGPT-4oモデルをも凌駕しました。プロジェクトのホームページは以下です：http://iip.whu.edu.cn/frank/index.html

English

Recent advances in Reasoning LLMs (e.g., DeepSeek-R1 and OpenAI-o1) have showcased impressive reasoning capabilities via reinforcement learning. However, extending these capabilities to Multimodal LLMs (MLLMs) is hampered by the prohibitive costs of retraining and the scarcity of high-quality, verifiable multimodal reasoning datasets. This paper introduces FRANK Model, a training-FRee ANd r1-liKe MLLM that imbues off-the-shelf MLLMs with reasoning and reflection abilities, without any gradient updates or extra supervision. Our key insight is to decouple perception and reasoning across MLLM decoder layers. Specifically, we observe that compared to the deeper decoder layers, the shallow decoder layers allocate more attention to visual tokens, while the deeper decoder layers concentrate on textual semantics. This observation motivates a hierarchical weight merging approach that combines a visual-pretrained MLLM with a reasoning-specialized LLM. To this end, we propose a layer-wise, Taylor-derived closed-form fusion mechanism that integrates reasoning capacity into deep decoder layers while preserving visual grounding in shallow decoder layers. Extensive experiments on challenging multimodal reasoning benchmarks demonstrate the effectiveness of our approach. On the MMMU benchmark, our model FRANK-38B achieves an accuracy of 69.2, outperforming the strongest baseline InternVL2.5-38B by +5.3, and even surpasses the proprietary GPT-4o model. Our project homepage is at: http://iip.whu.edu.cn/frank/index.html

MLLMにおけるトレーニング不要の推論とリフレクション

Training-Free Reasoning and Reflection in MLLMs

要旨

Support