ChatPaper.aiChatPaper

离策略影响引导的数据高效强化学习与视觉推理

Data-Efficient RLVR via Off-Policy Influence Guidance

October 30, 2025
作者: Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang
cs.AI

摘要

数据选择是提升大语言模型推理能力的可验证奖励强化学习(RLVR)中的关键环节。当前数据选择方法主要基于启发式规则,缺乏理论保证与泛化能力。本研究提出一种基于影响函数的理论驱动方法,通过量化每个数据点对学习目标的贡献度进行数据筛选。为克服在线影响估计所需的策略 rollout 带来的巨大计算开销,我们引入离线策略影响估计方法,利用预收集的离线轨迹高效近似数据影响力。针对大语言模型高维梯度带来的挑战,采用稀疏随机投影技术降低维度以提升存储与计算效率。基于上述技术,我们开发了具备离线策略影响引导的课程强化学习框架(CROPI),该多阶段RL框架能迭代筛选对当前策略最具影响力的数据。在70亿参数规模的模型实验表明,CROPI可显著加速训练过程:在15亿参数模型上,仅使用每阶段10%的数据量即可实现2.66倍的步级加速效果。研究结果验证了基于影响函数的数据选择方法在高效RLVR领域的巨大潜力。
English
Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.
PDF102January 19, 2026