추론 체인에서 검증 가능한 하위 문제로: 커리큘럼 강화 학습이 LLM 추론을 위한 신용 할당을 가능하게 한다

초록

검증 가능한 보상을 통한 강화 학습(RLVR)은 LLM 추론에 강력한 가능성을 보여주었으나, 결과 기반 RLVR은 어려운 문제에서 올바른 최종 답변 생성이 드물고 샘플 수준의 신용 할당이 실패한 시도에서의 부분적 진전을 활용할 수 없기 때문에 비효율적이다. 본 연구에서는 참조 추론 체인에서 검증 가능한 하위 문제를 도출하고 마지막 하위 문제를 원래 문제로 고정하는 커리큘럼 RL 프레임워크인 SCRL(하위 문제 커리큘럼 강화 학습)을 소개한다. 이는 어려운 문제에 대한 부분적 진전을 검증 가능한 학습 신호로 전환한다. 알고리즘적으로, SCRL은 하위 문제 수준 정규화를 사용하여 각 하위 문제 위치에서 보상을 독립적으로 정규화하고, 그 결과로 얻은 이점(advantage)을 해당 답변 구간에 할당함으로써 외부 평가 기준이나 보상 모델 없이도 세분화된 신용 할당을 가능하게 한다. 분석 결과, 하위 문제 커리큘럼은 어려운 문제를 기울기 소멸 영역에서 벗어나게 하며, 원래 문제가 더 어려워질수록 상대적 이득이 더 커진다. 7가지 수학 추론 벤치마크에서 SCRL은 강력한 커리큘럼 학습 기준선보다 우수한 성능을 보여, Qwen3-4B-Base에서 GRPO 대비 평균 정확도가 +4.1포인트, Qwen3-14B-Base에서 +1.9포인트 향상되었다. AIME24, AIME25, IMO-Bench에서는 Qwen3-4B-Base에서 pass@1이 +3.7포인트, pass@64가 +4.6포인트 추가로 개선되어, 어려운 추론 문제에 대한 더 나은 탐색을 나타낸다.

English

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.