언어 모델을 위한 RLVR에서의 학습 불가능성 현상

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 효과적임이 입증되었다. 그러나 RLVR의 학습 동역학은 아직 충분히 탐구되지 않았다. 본 논문에서는 직관에 반하는 현상을 밝힌다: 모델이 초기에 어려워하는 어려운 예제들 중 상당 부분이 올바른 롤아웃이 존재함에도 학습 불가능한 상태로 남아 있다. 이 현상을 이해하기 위해, 먼저 기존의 최적화 및 샘플링 기법이 학습 불가능성을 해결하지 못함을 보인다. 교차 예제 기울기 분석을 통해 학습 불가능한 예제가 근본적인 표현 문제를 가지고 있음을 보이며, 이는 다른 예제들과의 낮은 기울기 유사성과 일반화 불가능한 추론 패턴으로 특징지어진다. 또한 데이터 증강이 기울기 유사성을 개선하지 못하기 때문에 RL에서 표현 결함을 완화하기 어렵다는 것을 보인다. 본 연구는 RLVR 훈련에서 학습 불가능한 데이터에 대한 최초의 체계적 특성화를 제공하며, 추론 작업을 위한 현재 RL 접근법의 근본적인 한계를 밝힌다. 코드와 데이터는 https://github.com/yulinchen99/unlearnability-rlvr에서 확인할 수 있다.

English

Reinforcement Learning with Verifiable Reward (RLVR) has proven effective in improving Large Language Model's (LLM) reasoning ability. However, the learning dynamics of RLVR remain underexplored. In this paper, we reveal a counterintuitive phenomenon: among hard examples that the model initially struggles with, a substantial subset remains unlearnable even when correct rollouts are present. To understand the phenomenon, we first demonstrate that existing optimization and sampling techniques fail to resolve unlearnability. With cross-example gradient analysis, we show that unlearnable examples have fundamental representation issue, characterized by low gradient similarity with the rest of the examples and ungeneralizable reasoning patterns. We further show that representation flaws are difficult to mitigate in RL, as data augmentation does not improve gradient similarity. Our study provides the first systematic characterization of unlearnable data in RLVR training and reveals fundamental limitations in current RL approaches for reasoning tasks. Code and data are available at https://github.com/yulinchen99/unlearnability-rlvr.