MM-HELIX: 전체적 플랫폼과 적응형 하이브리드 정책 최적화를 통한 멀티모달 장기 반사적 사고 향상

초록

현재의 다중모달 대형 언어 모델(MLLMs)은 수학 및 논리와 같은 추론 과제에서 능숙함을 보여주었지만, 복잡한 현실 세계 문제 해결을 위한 전제 조건인 장기간의 반성적 추론 능력은 여전히 크게 탐구되지 않고 있습니다. 본 연구에서는 먼저 이 능력을 평가하기 위해 광범위한 실증적 조사를 수행합니다. 신중하게 설계된 데이터 합성 엔진을 활용하여, 반복적 사고와 역추적이 필요한 42개의 도전적인 합성 과제로 구성된 1,260개의 샘플을 포함한 다중모달 벤치마크인 MM-HELIX를 구축합니다. 이 벤치마크에 대한 실증적 결과는 기존 MLLMs가 장기간의 반성적 추론에서 상당한 성능 결함을 보인다는 것을 나타냅니다. 이러한 한계를 해결하기 위해, 사후 훈련 데이터를 생성하고 이러한 데이터를 활용하기 위한 학습 패러다임을 추가로 탐구합니다. 먼저, Step-Elicited Response Generation 파이프라인을 개발하여, 지시 튜닝 단계를 위한 100k개의 고품질 반성적 추론 흔적을 포함한 대규모 데이터셋인 MM-HELIX-100K를 생성합니다. 표준 강화 학습이 희소한 보상 신호와 지도 미세 조정 후의 치명적 망각으로 인해 복잡한 과제에서 실패한다는 점을 고려하여, 오프라인 감독과 온라인 최적화를 단일 단계로 동적으로 통합하는 새로운 훈련 전략인 적응형 하이브리드 정책 최적화(AHPO)를 제안합니다. 이 전략은 모델이 보상이 희소할 때 전문가 데이터로부터 학습하고, 숙련되면 독립적인 탐색을 수행할 수 있도록 합니다. Qwen2.5-VL-7B 기준선에 적용했을 때, 우리의 방법은 MM-HELIX 벤치마크에서 +18.6%의 정확도 향상을 달성하고, 일반 수학 및 논리 과제에서 평균 +5.7%의 성능 향상을 보여주며 강력한 일반화 능력을 입증합니다. 본 연구는 MLLMs에서의 반성적 추론이 효과적으로 학습되고 일반화될 수 있음을 보여주며, 더 능력 있는 MLLMs 개발을 위한 길을 열어줍니다.

English

While current Multimodal Large Language Models (MLLMs) have demonstrated proficiency in reasoning tasks such as mathematics and logic, their capacity for long-chain reflective reasoning, a prerequisite for solving complex real-world problems, remains largely underexplored. In this work, we first conduct an extensive empirical investigation to evaluate this capability. Leveraging a carefully designed data synthesis engine, we construct MM-HELIX, a multimodal benchmark consisting 1,260 samples of 42 challenging synthetic tasks that require iterative thinking and backtracking. Empirical results on this benchmark reveal that existing MLLMs exhibit significant performance deficits in long-chain reflective reasoning. To address this limitation, we generate post-training data and further explore learning paradigms for exploiting such data. We first develop the Step-Elicited Response Generation pipeline to create MM-HELIX-100K, a large-scale dataset of 100k high-quality, reflective reasoning traces for instruction-tuning stage. Given that standard Reinforcement Learning fails on complex tasks due to sparse reward signals and catastrophic forgetting after Supervised Fine-Tuning, we propose Adaptive Hybrid Policy Optimization (AHPO), a novel training strategy that dynamically unifies offline supervision and online optimization into a single stage. This strategy enables the model to learn from expert data when rewards are sparse and conduct independent exploration once proficient. When applied to the Qwen2.5-VL-7B baseline, our method achieves a +18.6\% accuracy improvement on MM-HELIX benchmark and demonstrates strong generalization with a +5.7\% average performance gain on general mathematic and logic tasks. Our work demonstrate that reflective reasoning in MLLMs can be effectively learned and generalized, paving the way for developing more capable MLLMs.

MM-HELIX: 전체적 플랫폼과 적응형 하이브리드 정책 최적화를 통한 멀티모달 장기 반사적 사고 향상

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization

초록

Support