DenoiseRL: 잡음 접두사 복구를 위한 추론 모델의 부트스트래핑

초록

강화 학습은 대규모 언어 모델의 추론 능력을 향상시키는 중심 패러다임이 되었지만, 기존의 대부분 방법들은 여전히 더 강력한 교사 모델이나 엄선된 어려운 데이터셋에 의존하여 확장 가능한 능력 향상에 제한이 있습니다. 본 논문에서는 약한 모델의 실패로부터 복구 지향적 최적화를 통해 외부 감독을 대체하는 강화 학습 프레임워크인 DenoiseRL을 소개합니다. 더 강력한 감독이나 정교하게 설계된 데이터에 의존하는 대신, DenoiseRL은 잘못된 추론 궤적으로부터 직접 학습하여 이를 개선 기회로 전환함으로써 훈련을 보다 확장 가능하게 하고 외부 자원에 대한 의존도를 낮춥니다. 이로 인해 더 풍부하고 다양한 학습 신호가 생성되어 불완전한 모델 행동으로부터 탐색 효율성을 향상시킵니다. 결과적으로, DenoiseRL은 값비싼 데이터 큐레이션이나 강력한 교사 모델의 필요성을 줄이면서 추론 성능과 전반적인 훈련 효율성을 향상시킵니다. 실험적으로, DenoiseRL은 경쟁적인 수학 및 일반 추론 벤치마크에서 강력한 온-정책 강화 학습 기준선을 일관되게 능가하며, 훈련 난이도가 증가함에 따라 더 강력한 자기 교정 행동을 촉진하여 대규모 언어 모델의 추론 개선을 위한 효과적이고 확장 가능한 대안적 경로를 강조합니다.

English

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.