语言模型中RLVR的不可学习性现象

摘要

基于可验证奖励的强化学习（RLVR）已被证明能有效提升大型语言模型（LLM）的推理能力。然而，RLVR的学习动态仍未得到充分探索。本文揭示了一个反直觉的现象：在模型初始阶段难以处理的硬示例中，有相当一部分子集即便存在正确的轨迹样本，仍然无法被学习。为了理解这一现象，我们首先证明现有优化和采样技术无法解决不可学习性问题。通过跨示例梯度分析，我们发现不可学习示例存在根本性的表示问题，其特征是与其余示例的梯度相似性较低，且推理模式缺乏泛化能力。我们进一步表明，表示缺陷在RL中难以缓解，因为数据增强并不能改善梯度相似性。本研究首次系统刻画了RLVR训练中的不可学习数据，并揭示了当前基于RL的推理方法存在的根本局限性。代码和数据见https://github.com/yulinchen99/unlearnability-rlvr。

English

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.