ChatPaper.aiChatPaper

看得更多就意味着知道得更多吗?多源视觉推理中的单锚优势归一化

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

May 25, 2026
作者: Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun
cs.AI

摘要

通过具有可验证奖励的强化学习(RLVR)进行视觉推理已取得显著进展。然而,在处理多源输入时,现有方法往往将其视为信息的简单叠加,缺乏明确机制来区分整合额外源是否带来信息增益或引入干扰。因此,它们在融合多源信息时难以有效建模动态交互,尤其当不同源在物理属性和语义上存在显著差异时(例如红外与深度信息),若某一源包含主导信号,其性能甚至可能低于单源推理。为解决该问题,我们提出MARS——一种新颖的单锚定多源推理框架,将每种视觉模态建模为独立信息源。具体而言,通过将单源奖励视为动态锚点,我们的方法将多源融合引入的信息增益显式纳入优势归一化过程,并在RLVR中自适应地增强源间相互促进的作用,同时抑制潜在噪声或冲突。理论分析表明,该方法能有效量化梯度估计中多源整合引入的信息增益,实现模态的一致性调节。大量数据集上的实验结果也显示,该方法在GRPO和DAPO上分别实现了3.2%和4.9%的性能提升,验证了其有效性。
English
Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.