MM-HELIX:通過整體平台與自適應混合策略優化提升多模態長鏈反思推理能力
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
October 9, 2025
作者: Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang
cs.AI
摘要
儘管當前的多模態大型語言模型(MLLMs)在數學和邏輯等推理任務中展現出熟練能力,但其長鏈條反思推理的能力——解決複雜現實問題的先決條件——仍大多未被充分探索。在本研究中,我們首先進行了廣泛的實證調查以評估這一能力。利用精心設計的數據合成引擎,我們構建了MM-HELIX,這是一個包含1,260個樣本、涵蓋42項需要迭代思考和回溯的挑戰性合成任務的多模態基準。在該基準上的實證結果顯示,現有的MLLMs在長鏈條反思推理方面表現出顯著的性能缺陷。為解決這一限制,我們生成了後訓練數據,並進一步探索了利用此類數據的學習範式。我們首先開發了步驟引導響應生成管道,創建了MM-HELIX-100K,這是一個包含10萬條高質量反思推理軌跡的大規模數據集,用於指令微調階段。考慮到標準強化學習在複雜任務上因稀疏獎勵信號和監督微調後的災難性遺忘而失效,我們提出了自適應混合策略優化(AHPO),這是一種新穎的訓練策略,動態地將離線監督和在線優化統一為單一階段。此策略使模型能在獎勵稀疏時從專家數據中學習,並在熟練後進行獨立探索。當應用於Qwen2.5-VL-7B基線時,我們的方法在MM-HELIX基準上實現了+18.6%的準確率提升,並在一般數學和邏輯任務上展現出強勁的泛化能力,平均性能增益達+5.7%。我們的工作證明,MLLMs中的反思推理可以被有效學習和泛化,為開發更強大的MLLMs鋪平了道路。
English
While current Multimodal Large Language Models (MLLMs) have demonstrated
proficiency in reasoning tasks such as mathematics and logic, their capacity
for long-chain reflective reasoning, a prerequisite for solving complex
real-world problems, remains largely underexplored. In this work, we first
conduct an extensive empirical investigation to evaluate this capability.
Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a
multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks
that require iterative thinking and backtracking. Empirical results on this
benchmark reveal that existing MLLMs exhibit significant performance deficits
in long-chain reflective reasoning. To address this limitation, we generate
post-training data and further explore learning paradigms for exploiting such
data. We first develop the Step-Elicited Response Generation pipeline to create
MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning
traces for instruction-tuning stage. Given that standard Reinforcement Learning
fails on complex tasks due to sparse reward signals and catastrophic forgetting
after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization
(AHPO), a novel training strategy that dynamically unifies offline supervision
and online optimization into a single stage. This strategy enables the model to
learn from expert data when rewards are sparse and conduct independent
exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our
method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and
demonstrates strong generalization with a +5.7\% average performance gain on
general mathematic and logic tasks. Our work demonstrate that reflective
reasoning in MLLMs can be effectively learned and generalized, paving the way
for developing more capable MLLMs.