MM-HELIX：ホリスティックプラットフォームと適応型ハイブリッドポリシー最適化によるマルチモーダル長鎖反射的推論の強化

要旨

現在のマルチモーダル大規模言語モデル（MLLM）は、数学や論理などの推論タスクにおいて高い能力を示しているが、複雑な現実世界の問題を解決するために必要な長鎖型の反射的推論能力については、まだ十分に探求されていない。本研究では、まずこの能力を評価するための広範な実証調査を実施する。慎重に設計されたデータ合成エンジンを活用し、反復的思考とバックトラッキングを必要とする42の挑戦的な合成タスクからなる1,260サンプルのマルチモーダルベンチマーク「MM-HELIX」を構築した。このベンチマークでの実証結果から、既存のMLLMが長鎖型の反射的推論において著しい性能不足を示すことが明らかになった。この制約に対処するため、ポストトレーニングデータを生成し、そのデータを活用するための学習パラダイムをさらに探求する。まず、Step-Elicited Response Generationパイプラインを開発し、命令チューニング段階のための10万件の高品質な反射的推論トレースを含む大規模データセット「MM-HELIX-100K」を作成した。標準的な強化学習が、報酬信号の希薄さや教師あり微調整後の致命的な忘却により複雑なタスクで失敗することを考慮し、オフラインの監督とオンラインの最適化を単一の段階に動的に統合する新しいトレーニング戦略「Adaptive Hybrid Policy Optimization（AHPO）」を提案する。この戦略により、モデルは報酬が希薄な場合に専門家データから学習し、熟達した後は独立した探索を行うことができる。Qwen2.5-VL-7Bベースラインに適用した結果、MM-HELIXベンチマークで+18.6%の精度向上を達成し、一般的な数学および論理タスクにおいても+5.7%の平均性能向上を示す強力な汎化能力を実証した。本研究は、MLLMにおける反射的推論が効果的に学習および汎化可能であることを示し、より能力の高いMLLMの開発への道を開くものである。

English

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

MM-HELIX：ホリスティックプラットフォームと適応型ハイブリッドポリシー最適化によるマルチモーダル長鎖反射的推論の強化

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

要旨

Support