더 많이 본다는 것이 더 많이 안다는 것을 의미하는가? 다중 소스 시각적 추론을 위한 단일 고정점 이점 정규화

초록

검증 가능한 보상을 통한 강화 학습 기반 시각적 추론(RLVR)이 놀라운 진전을 이루었다. 그러나 다중 소스 입력을 다룰 때 기존 접근법은 이를 단순한 정보의 축적으로 취급하며, 추가 소스를 통합하는 것이 정보 이득을 가져오는지 간섭을 유발하는지 구별하는 명시적 메커니즘이 부족하다. 따라서 특히 적외선 및 깊이와 같이 물리적 속성과 의미론에서 현저히 다른 다중 소스를 통합할 때 동적 상호작용을 효과적으로 모델링하는 데 어려움을 겪으며, 특정 소스가 지배적 신호를 보유할 때 단일 소스 추론보다 성능이 저하된다. 이 문제를 해결하기 위해 우리는 각 시각적 모달리티를 독립적인 정보 소스로 모델링하는 새로운 단일 앵커 다중 소스 추론 프레임워크인 MARS를 제안한다. 구체적으로, 단일 소스 보상을 동적 앵커로 처리함으로써, 우리 방법은 다중 소스 융합으로 인한 정보 이득을 이점 정규화에 명시적으로 포함시키고, RLVR 과정에서 잠재적인 노이즈나 충돌을 억제하면서 소스 간 상호 촉진을 적응적으로 강조한다. 이론적 분석을 통해, 우리 방법은 그래디언트 추정에서 다중 소스 통합이 도입하는 정보 이득을 효과적으로 정량화하여 일관된 모달리티 조절을 가능하게 한다. 실험 결과는 또한 GRPO와 DAPO에서 다양한 데이터셋에 걸쳐 각각 3.2% 및 4.9%의 인상적인 성능 향상을 보여주며, 우리 방법의 효과성을 확인한다.

English

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.