다중모드 추론을 위한 인지 인식 정책 최적화

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델(Large Language Models, LLMs)에 견고한 다단계 추론 능력을 부여하는 데 매우 효과적인 전략으로 입증되었습니다. 그러나 그 설계와 최적화는 순수 텍스트 영역에 맞춰져 있어, 다중 모달리티(multimodal) 추론 작업에 적용할 때 최적의 성능을 발휘하지 못합니다. 특히, 현재 다중 모달리티 추론에서 주요 오류 원인은 시각적 입력에 대한 인식(perception) 문제임을 관찰했습니다. 이러한 병목 현상을 해결하기 위해, 우리는 GRPO(Generalized Reinforcement Policy Optimization)의 간단하지만 효과적인 확장인 Perception-Aware Policy Optimization(PAPO)을 제안합니다. PAPO는 모델이 내부 감독 신호만을 통해 추론을 학습하는 동시에 인식 능력을 학습하도록 유도하며, 추가 데이터 큐레이션, 외부 보상 모델 또는 독점 모델에 의존하지 않습니다. 구체적으로, 우리는 GRPO 목적 함수에 KL 발산(KL divergence) 항으로 구성된 암묵적 인식 손실(Implicit Perception Loss)을 도입했습니다. 이는 단순함에도 불구하고 다양한 다중 모달리티 벤치마크에서 4.4%의 전반적인 성능 향상을 가져왔습니다. 특히 시각 의존도가 높은 작업에서는 8.0%에 가까운 더 큰 개선을 보였습니다. 또한, PAPO를 통해 인식 오류가 30.5% 크게 감소했으며, 이는 향상된 인식 능력을 나타냅니다. 우리는 PAPO에 대한 포괄적인 분석을 수행하고, 고유한 손실 해킹(loss hacking) 문제를 식별하여 이를 Double Entropy Loss를 통해 엄격히 분석하고 완화했습니다. 전반적으로, 우리의 연구는 RLVR 학습 목표에 인식 감독을 더 깊이 통합하고, 시각적으로 근거 있는 추론을 촉진하는 새로운 RL 프레임워크의 기반을 마련했습니다. 프로젝트 페이지: https://mikewangwzhl.github.io/PAPO.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.

다중모드 추론을 위한 인지 인식 정책 최적화

Perception-Aware Policy Optimization for Multimodal Reasoning

초록

Support