語言模型中 RLVR 的不可學習性現象

摘要

基於可驗證獎勵的強化學習（RLVR）已被證明能有效提升大型語言模型（LLM）的推理能力。然而，RLVR 的學習動態尚未被充分探索。本文揭示了一個違反直覺的現象：在模型最初難以處理的困難樣本中，即便存在正確的生成結果，仍有相當一部分樣本始終無法被學到。為理解此現象，我們首先證明現有的優化與採樣技術無法解決這種「無法學習性」。透過跨樣本的梯度分析，我們顯示無法學習的樣本存在根本性的表徵問題，其特徵為與其他樣本的梯度相似度低，且推理模式無法泛化。我們進一步說明，RL 訓練中難以修正表徵缺陷，因為數據增強並未改善梯度相似性。本研究首次對 RLVR 訓練中的無法學習數據進行系統性描述，並揭示當前 RL 方法在推理任務上的根本限制。程式碼與數據請見 https://github.com/yulinchen99/unlearnability-rlvr。

English

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.