RLFR:擴展強化學習於大型語言模型中的流動環境
RLFR: Extending Reinforcement Learning for LLMs with Flow Environment
October 11, 2025
作者: Jinghao Zhang, Naishan Zheng, Ruilin Li, Dongzhou Cheng, Zheming Liang, Feng Zhao, Jiaqi Wang
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)近期作為提升大型語言模型(LLMs)推理能力的一種有前景框架而嶄露頭角。然而,基於二元驗證優化的策略易於忽視推理軌跡中潛在的寶貴探索。考慮到黃金過程獎勵模型(PRMs)的高昂標註成本,近期研究嘗試利用輔助信號對過程令牌進行獎勵塑形,包括從logit空間收集的熵和似然度。在本研究中,我們提供了一種新穎視角,即從潛在空間導出的流動獎勵來塑造RLVR,並提出了RLFR,其中模型的潛在流動場由離策略高質量數據和策略內拒絕採樣數據構建,並量化策略潛在在其中的速度偏差作為獎勵信號。RLFR首次證明,一個完善的流動場可以作為收集獎勵信號的良好環境,強調了表達性潛在空間的未充分探索性。此外,RLFR能夠壓縮任何離策略專家數據作為構成獎勵信號的參考,我們展示了隱藏狀態中壓縮的高效上下文依賴性被利用,而非單一令牌級別的表示來理解上下文。在語言和多模態推理基準上的實驗證明了流動獎勵的可靠性,並為利用輔助信號進行獎勵塑形提出了一種有前景的範式。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as
a promising framework for improving reasoning abilities in Large Language
Models (LLMs). However, policy optimized with binary verification prone to
overlook potential valuable exploration in reasoning trajectory. In view of
heavy annotation cost of golden Process Reward Models (PRMs), recent works
attempt using auxiliary signals for reward shaping of process tokens, involving
entropy and likelihood collected from logit space. In this work, we offer a
novel perspective on shaping RLVR with flow rewards derived from latent space,
and propose RLFR, where the flow fields of model latents are constructed from
either off-policy high-quality data and on-policy rejection sampling data, and
the velocity deviations of policy latents within it are quantified to serve as
a reward signal. RLFR first demonstrates that a well-established flow field can
be a sound environment for reward signal collection, highlighting the
expressive latent space is much underexplored. Moreover, RLFR is able to
compress any off-policy expert data as reference for constituting reward
signals, and we show that the efficient context dependence compressed within
the hidden states are utilized, rather than individual token-level denotation
for context comprehending. Experiments on both language and multimodal
reasoning benchmarks demonstrate the reliability of flow rewards, and
suggesting a promising paradigm for reward shaping with auxiliary signals.