より多く見ることが、より多くの知識を意味するのか？—マルチソース視覚推論のための単一アンカー優位性正規化

要旨

検証可能な報酬を用いた強化学習（RLVR）による視覚推論は目覚ましい進歩を遂げている。しかし、マルチソース入力を扱う場合、既存の手法は単なる情報の蓄積として扱う傾向があり、追加の情報源を統合することで情報利得が得られるか、あるいは干渉が生じるかを区別する明示的な機構を欠いている。そのため、特に赤外線や深度など物理的特性や意味内容が大きく異なる情報源を統合する際に、動的な相互作用を効果的にモデル化できず、ある情報源が支配的な信号を有する場合にはモノソース推論よりも性能が劣ることになる。この問題に対処するため、我々はMARSという新たなモノアンカー型マルチソース推論フレームワークを提案する。本フレームワークは各視覚モダリティを独立した情報源としてモデル化する。具体的には、単一情報源の報酬を動的アンカーとして扱うことで、マルチソース融合によって導入される情報利得を明示的にアドバンテージ正規化に組み込み、RLVR中に情報源間の相互促進を適応的に強調しつつ、潜在的なノイズや競合を抑制する。理論解析により、本手法は勾配推定においてマルチソース統合によって導入される情報利得を効果的に定量化し、一貫したモダリティ調整を可能にする。実証結果においても、GRPOおよびDAPOにおいて多様なデータセットでそれぞれ平均3.2%および4.9%の性能向上を示し、本手法の有効性を確認している。

English

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.