LLM 추론을 위한 강화학습 재고찰: 능력 학습이 아닌 희소 정책 선택

초록

강화 학습은 대규모 언어 모델의 추론 능력 향상을 위한 표준 기법이 되었지만, 점점 더 많은 증거는 RL이 새로운 전략을 가르치는 것이 아니라 베이스 모델이 이미 포함하고 있는 솔루션에 대해 확률 질량을 재분배한다는 것을 보여준다. 본 연구에서는 다음과 같은 질문을 던진다: 만약 RL이 단순히 모델을 이미 알고 있는 경로로 유도하는 것이라면, RL 최적화 루프 자체가 과연 필요한가? 여러 모델군과 RL 알고리즘에 걸친 토큰 수준 분석을 통해, 우리는 RL의 유익한 영향이 모델이 어느 분기를 선택해야 할지 불확실한 고엔트로피 의사 결정 지점에 집중된 희소하고 예측 가능한 교정임을 발견했다. 영향을 받는 토큰 위치는 1-3%에 불과하며, 승격된 토큰은 항상 베이스 모델의 상위 5개 대안 내에 존재하고, 그 소수 위치에 대한 표적 교정은 RL의 정확도 향상의 상당 부분을 인과적으로 회복하는 반면, 무작위 교정은 실패한다. 베이스 모델 자체의 엔트로피는 RL 훈련 모델 없이도 이러한 위치를 식별하며, 전체 교정은 저차원적이어서 모델 파라미터의 극히 일부로 표현 가능하다. 이러한 발견은 추론 향상을 능력 획득이 아닌 희소 정책 선택으로 재구성한다. 이 통찰을 ReasonMaxxer로 전환하여, 엔트로피 게이트 의사 결정 지점에서만 대비 손실을 적용하는 최소한의 RL-무함유 방법을 제시하며, 수백 개의 베이스 모델 롤아웃과 온라인 생성 없이 수행된다. 세 가지 모델군, 여섯 가지 규모, 여섯 가지 수학 추론 벤치마크에 걸쳐 ReasonMaxxer는 전체 RL 성능에 필적하거나 능가하면서, 단 수십 개의 문제와 수 분의 단일 GPU 훈련만 필요로 하여 훈련 비용을 약 세 자릿수 감소시킨다.

English

Reinforcement learning has become the standard for improving reasoning in large language models, yet evidence increasingly suggests that RL does not teach new strategies; it redistributes probability mass over solutions the base model already contains. In this work, we ask: if RL merely steers the model toward paths it already knows, is the RL optimization loop itself necessary? Through token-level analysis across multiple model families and RL algorithms, we find that RL's beneficial footprint is a sparse, predictable correction concentrated at high-entropy decision points where the model is uncertain which branch to take. Only 1--3\% of token positions are affected, the promoted token always lies within the base model's top-5 alternatives, and targeted corrections at those few positions causally recover a large fraction of RL's accuracy gain, while random corrections fail. The base model's own entropy identifies these positions without any RL-trained model, and the entire correction is low-dimensional, representable in a tiny fraction of model parameters. These findings reframe reasoning improvement as sparse policy selection, not capability acquisition. We translate this insight into ReasonMaxxer, a minimal RL-free method that applies contrastive loss only at entropy-gated decision points, using a few hundred base-model rollouts and no online generation. Across three model families, six scales, and six math reasoning benchmarks, ReasonMaxxer matches or exceeds full RL performance while requiring only tens of problems and minutes of single-GPU training, a reduction in training cost of roughly three orders of magnitude.