포화 문제에 대한 추론 모델 훈련: 실패 접두어 조건화를 통한 접근

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 크게 향상시켰지만, 문제가 포화 상태에 이르면 학습이 자주 정체됩니다. 우리는 핵심적인 어려움을 정보적 실패(informative failures)의 낮은 접근성으로 파악했습니다. 즉, 학습 신호는 존재하지만 표준 롤아웃 과정에서는 거의 발생하지 않는다는 점입니다. 이를 해결하기 위해 우리는 포화된 문제에서도 학습할 수 있는 간단하면서 효과적인 방법인 실패 접두어 조건화(failure-prefix conditioning)를 제안합니다. 원래 질문에서 시작하는 대신, 흔치 않은 오류 추론 궤적에서 도출된 접두어를 조건으로 하여 훈련을 재배치함으로써 모델이 실패하기 쉬운 상태에 노출되도록 합니다. 우리는 실패 접두어 조건화가 중간 난이도 문제로 훈련하는 것과 맞먹는 성능 향상을 가져오면서도 토큰 효율성을 유지한다는 것을 관찰했습니다. 더 나아가 모델의 강건성을 분석한 결과, 우리의 방법이 올바른 초기 추론에 대한 adherence가 약간 희생되기는 하지만, 오해의 소지가 있는 실패 접두어 하에서의 성능 저하를 줄인다는 사실을 확인했습니다. 마지막으로, 훈련 중 실패 접두어를 갱신하는 반복적 접근법이 성능 정체기 이후 추가적인 향상을 이끌어낸다는 점을 입증했습니다. 전반적으로, 우리의 결과는 실패 접두어 조건화가 포화된 문제에 대한 RLVR 훈련을 확장하는 효과적인 경로를 제공함을 시사합니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning abilities of large language models (LLMs), yet training often stalls as problems become saturated. We identify the core challenge as the poor accessibility of informative failures: learning signals exist but are rarely encountered during standard rollouts. To address this, we propose failure-prefix conditioning, a simple and effective method for learning from saturated problems. Rather than starting from the original question, our approach reallocates exploration by conditioning training on prefixes derived from rare incorrect reasoning trajectories, thereby exposing the model to failure-prone states. We observe that failure-prefix conditioning yields performance gains matching those of training on medium-difficulty problems, while preserving token efficiency. Furthermore, we analyze the model's robustness, finding that our method reduces performance degradation under misleading failure prefixes, albeit with a mild trade-off in adherence to correct early reasoning. Finally, we demonstrate that an iterative approach, which refreshes failure prefixes during training, unlocks additional gains after performance plateaus. Overall, our results suggest that failure-prefix conditioning offers an effective pathway to extend RLVR training on saturated problems.

포화 문제에 대한 추론 모델 훈련: 실패 접두어 조건화를 통한 접근

Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning

초록

Support