다중 모달 체인 오브 씽크를 위한 토큰 수준 정책 최적화 재고

초록

다중모드 사고 연쇄(CoT) 추론은 대규모 시각-언어 모델이 지각적 근거와 다단계 추론을 교차하며 구성하는 추론 경로를 요구합니다. 그러나 검증 가능한 보상을 활용한 기존 강화 학습(RLVR) 방법들은 일반적으로 CoT를 균일하게 취급하고 시각적 근거 정도의 차이를 구분하지 않은 채 추론을 거친 단위로 최적화합니다. 본 연구에서는 다중모드 추론 경로에 대한 토큰 수준 분석을 수행하며, 성공적인 추론이 지각적 근거와 탐색적 추론을 모두 반영하는 구조화된 토큰 역학으로 특징지어진다는 점을 보여줍니다. 이 분석을 바탕으로, 우리는 은닉 상태 유사성에서 도출된 지각 사전 확률을 부드러운 게이트 메커니즘을 통해 토큰 엔트로피와 통합하여 토큰 수준 이점을 생성하는 Perception-Exploration Policy Optimization(PEPO)을 제안합니다. PEPO는 GRPO 및 DAPO와 같은 기존 RLVR 프레임워크에 추가 감독이나 보조 분기 없이 원활하게 통합됩니다. 다양한 다중모드 벤치마크에서 진행된 포괄적인 실험을 통해 기하학적 추론, 시각적 근거, 시각적 퍼즐 해결, 소수 샷 분류에 이르기까지 강력한 RL 기준선 대비 일관되고 견고한 성능 향상을 보여주며, 안정적인 학습 역학을 유지합니다. 코드: https://github.com/xzxxntxdy/PEPO

English

Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show that successful reasoning is characterized by structured token dynamics reflecting both perceptual grounding and exploratory inference. Building upon this analysis, we propose Perception-Exploration Policy Optimization (PEPO), which derives a perception prior from hidden state similarity and integrates it with token entropy through a smooth gating mechanism to produce token-level advantages. PEPO integrates seamlessly with existing RLVR frameworks such as GRPO and DAPO, requiring neither additional supervision nor auxiliary branches. Extensive experiments across diverse multimodal benchmarks demonstrate consistent and robust improvements over strong RL baselines, spanning geometry reasoning, visual grounding, visual puzzle solving, and few-shot classification, while maintaining stable training dynamics. Code: https://github.com/xzxxntxdy/PEPO

다중 모달 체인 오브 씽크를 위한 토큰 수준 정책 최적화 재고

Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought

초록

Support