ChatPaper.aiChatPaper

聚焦于多模态强化学习中的令牌感知

Spotlight on Token Perception for Multimodal Reinforcement Learning

October 10, 2025
作者: Siyuan Huang, Xiaoye Qu, Yafu Li, Yun Luo, Zefeng He, Daizong Liu, Yu Cheng
cs.AI

摘要

尽管可验证奖励的强化学习(RLVR)已显著提升了大型视觉语言模型(LVLMs)的推理能力,但现有的大多数多模态推理方法却忽视了视觉感知在RLVR优化过程中的关键作用。本文首次从令牌感知这一新颖视角出发,对多模态RLVR进行了开创性探索,其中令牌感知衡量了每个生成令牌对视觉的依赖程度。通过对思维链(CoT)过程的细致分析,我们揭示了两大关键发现:首先,在一条轨迹中,令牌感知呈稀疏分布,仅有少数令牌对视觉依赖度高,用于基于视觉的推理;其次,不同轨迹在整体视觉依赖性上表现出显著差异。基于这些观察,我们提出了视觉感知策略优化(VPPO),这是一种新颖的策略梯度算法,它明确利用令牌感知来精炼学习信号。具体而言,VPPO通过双重机制实现这一点:它根据轨迹的整体视觉依赖性重新加权其优势,并仅对感知上关键的令牌进行策略更新。在一套包含八个感知与推理基准的全面测试中,VPPO相较于领先的开源RL调优模型展现了显著优势,其有效性在7B和32B模型规模上均得到了一致验证。我们的研究不仅为分析多模态RLVR建立了一个新的令牌级感知视角,还提出了一种新颖且有效的优化策略,显著增强了LVLMs的多模态推理能力。
English
While Reinforcement Learning with Verifiable Rewards (RLVR) has advanced the reasoning capabilities of Large Vision-Language Models (LVLMs), most existing methods in multimodal reasoning neglect the critical role of visual perception within the RLVR optimization process. In this paper, we undertake a pioneering exploration of multimodal RLVR through the novel perspective of token perception, which measures the visual dependency of each generated token. With a granular analysis of Chain-of-Thought (CoT) processes, we uncover two key insights: first, token perception in a rollout trajectory is sparsely distributed, where only a small fraction of tokens have high visual dependency for visually-grounded reasoning; second, different trajectories exhibit significant divergence in their overall visual dependency. Based on these observations, we propose Visually-Perceptive Policy Optimization (VPPO), a novel policy gradient algorithm that explicitly leverages token perception to refine the learning signal. Specifically, VPPO achieves this through a dual mechanism: it reweights a trajectory's advantage by its overall visual dependency, and focuses policy updates exclusively on perceptually pivotal tokens. On a comprehensive suite of eight perception and reasoning benchmarks, VPPO demonstrates substantial gains over leading open-source RL-tuned models, with its effectiveness consistently validated across 7B and 32B model scales. Our findings not only establish a new token-level perceptual perspective for analyzing multimodal RLVR but also present a novel and effective optimization strategy to significantly enhance the multimodal reasoning capabilities of LVLMs.
PDF353October 14, 2025