ChatPaper.aiChatPaper

RLFR:基于流环境扩展大语言模型的强化学习

RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

October 11, 2025
作者: Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang
cs.AI

摘要

可验证奖励的强化学习(RLVR)近期崭露头角,成为提升大型语言模型(LLMs)推理能力的一个有前景的框架。然而,采用二元验证优化的策略容易忽视推理轨迹中潜在的宝贵探索。鉴于黄金过程奖励模型(PRMs)的高昂标注成本,近期研究尝试利用辅助信号对过程令牌进行奖励塑造,涉及从logit空间收集的熵和似然度。本研究中,我们提出了一种新颖视角,通过源自潜在空间的流奖励来塑造RLVR,并提出了RLFR方法。在该方法中,模型的潜在流场既可由离策略高质量数据构建,也可由在策略拒绝采样数据构建,策略潜在在其中的速度偏差被量化作为奖励信号。RLFR首次证明,一个完善的流场可以作为收集奖励信号的可靠环境,凸显了表达性潜在空间的巨大未开发潜力。此外,RLFR能够压缩任何离策略专家数据作为构成奖励信号的参考,并展示了利用隐藏状态中压缩的高效上下文依赖,而非单个令牌级别的指称来理解上下文。在语言和多模态推理基准上的实验验证了流奖励的可靠性,为利用辅助信号进行奖励塑造提供了一个有前景的范式。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.
PDF352October 14, 2025