MM-HELIX:通过整体平台与自适应混合策略优化提升多模态长链反思推理能力
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization
October 9, 2025
作者: Xiangyu Zhao, Junming Lin, Tianhao Liang, Yifan Zhou, Wenhao Chai, Yuzhe Gu, Weiyun Wang, Kai Chen, Gen Luo, Wenwei Zhang, Junchi Yan, Hua Yang, Haodong Duan, Xue Yang
cs.AI
摘要
尽管当前的多模态大语言模型(MLLMs)在数学和逻辑等推理任务上已展现出熟练能力,但其长链反思推理能力——解决复杂现实问题的先决条件——仍很大程度上未被充分探索。在本研究中,我们首先开展了一项广泛的实证调查以评估这一能力。借助精心设计的数据合成引擎,我们构建了MM-HELIX,一个包含1,260个样本、涵盖42项挑战性合成任务的多模态基准,这些任务要求迭代思维和回溯。基于该基准的实证结果显示,现有MLLMs在长链反思推理方面存在显著的性能缺陷。为克服这一局限,我们生成了后训练数据,并进一步探索了利用此类数据的学习范式。我们首先开发了步骤引导响应生成流程,创建了MM-HELIX-100K,一个包含10万条高质量反思推理轨迹的大规模数据集,用于指令微调阶段。鉴于标准强化学习在复杂任务上因稀疏奖励信号及监督微调后的灾难性遗忘而失效,我们提出了自适应混合策略优化(AHPO),一种新颖的训练策略,动态地将离线监督与在线优化统一于单一阶段。该策略使模型能在奖励稀疏时从专家数据中学习,并在熟练后进行独立探索。将我们的方法应用于Qwen2.5-VL-7B基线模型时,在MM-HELIX基准上实现了+18.6%的准确率提升,并在一般数学和逻辑任务上展现出强大的泛化能力,平均性能增益达+5.7%。我们的研究表明,MLLMs中的反思推理能够被有效学习并泛化,为开发更强大的MLLMs铺平了道路。
English
While current Multimodal Large Language Models (MLLMs) have demonstrated
proficiency in reasoning tasks such as mathematics and logic, their capacity
for long-chain reflective reasoning, a prerequisite for solving complex
real-world problems, remains largely underexplored. In this work, we first
conduct an extensive empirical investigation to evaluate this capability.
Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a
multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks
that require iterative thinking and backtracking. Empirical results on this
benchmark reveal that existing MLLMs exhibit significant performance deficits
in long-chain reflective reasoning. To address this limitation, we generate
post-training data and further explore learning paradigms for exploiting such
data. We first develop the Step-Elicited Response Generation pipeline to create
MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning
traces for instruction-tuning stage. Given that standard Reinforcement Learning
fails on complex tasks due to sparse reward signals and catastrophic forgetting
after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization
(AHPO), a novel training strategy that dynamically unifies offline supervision
and online optimization into a single stage. This strategy enables the model to
learn from expert data when rewards are sparse and conduct independent
exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our
method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and
demonstrates strong generalization with a +5.7\% average performance gain on
general mathematic and logic tasks. Our work demonstrate that reflective
reasoning in MLLMs can be effectively learned and generalized, paving the way
for developing more capable MLLMs.