言語モデル向けRLVRにおける非学習可能性現象

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を向上させる上で効果的であることが示されている。しかし、RLVRの学習ダイナミクスは未だ十分に解明されていない。本論文では、直感に反する現象を明らかにする。すなわち、モデルが当初苦戦する困難な例のうち、かなりの部分が、正しいロールアウトが存在する場合でも学習不可能なままである。この現象を理解するため、まず既存の最適化手法やサンプリング手法では学習不可能性を解決できないことを示す。さらに、サンプル間勾配解析により、学習不可能な例には根本的な表現の問題が存在し、他の例との勾配類似度が低く、一般化できない推論パターンを持つことを明らかにする。また、データ拡張によって勾配類似度が改善されないことから、RLにおいて表現の欠陥を軽減することは困難であることを示す。本研究は、RLVR訓練における学習不可能データの初の体系的な特徴づけを提供し、推論タスクに対する現在のRLアプローチの根本的な限界を明らかにする。コードとデータはhttps://github.com/yulinchen99/unlearnability-rlvrで入手可能である。

English

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.