看見更多就代表知道更多嗎?用於多源視覺推理的單錨定優勢正規化
Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning
May 25, 2026
作者: Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun
cs.AI
摘要
通過可驗證獎勵的強化學習(RLVR)進行視覺推理已取得顯著進展。然而,在處理多源輸入時,現有方法往往將其視為單純的資訊累積,缺乏明確機制來區分整合額外來源是否帶來資訊增益或引入干擾。因此,它們難以在整合多個來源時有效建模動態互動,特別是當這些來源在物理特性與語義上差異顯著時(例如紅外線與深度),導致當某一來源主導訊號時,其表現甚至不如單源推理。為了解決此問題,我們提出MARS,一種新穎的單錨多源推理框架,將每種視覺模態建模為獨立資訊來源。具體而言,透過將單源獎勵視為動態錨點,我們的方法明確地將多源融合引入的資訊增益納入優勢正規化,並在RLVR過程中自適應地強調來源間的相互促進,同時抑制潛在的噪音或衝突。從理論分析來看,我們的方法能有效量化梯度估計中多源整合引入的資訊增益,從而實現一致的模態調控。實驗結果亦顯示,在GRPO與DAPO上,我們的方法在各類數據集中分別取得了3.2%與4.9%的顯著性能提升,證實了其有效性。
English
Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.