스위트 스팟 유지: 능력 적응형 힌트 스캐폴딩을 통한 반응형 추론 진화

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 데 있어서 주목할 만한 성과를 거두었습니다. 그러나 기존의 RLVR 방법들은 학습 데이터의 난이도와 모델의 능력 간의 불일치로 인해 탐색 효율성이 떨어지는 문제를 자주 겪습니다. 문제가 지나치게 어려울 경우 LLM은 실행 가능한 추론 경로를 발견하지 못하고, 문제가 너무 쉬울 경우에는 새로운 능력을 거의 학습하지 못합니다. 본 연구에서는 문제 난이도의 영향을 손실 감소 속도와 롤아웃 정확도 간의 관계를 정량화함으로써 공식화합니다. 이 분석을 바탕으로, 우리는 SEELE라는 새로운 지도 학습 지원 RLVR 프레임워크를 제안합니다. SEELE는 각 학습 샘플에 원래 문제 뒤에 힌트(전체 해결책의 일부)를 추가하여 확장함으로써 문제 난이도를 동적으로 조정하여 고효율 영역 내에 유지합니다. 기존의 힌트 기반 접근법과 달리, SEELE는 각 문제에 대해 의도적으로 적응적으로 힌트 길이를 조정하여 최적의 난이도를 달성합니다. 최적의 힌트 길이를 결정하기 위해 SEELE는 다중 라운드 롤아웃 샘플링 전략을 사용합니다. 각 라운드에서, 이전 라운드에서 수집된 정확도-힌트 쌍에 대해 항목 반응 이론 모델을 피팅하여 다음 라운드에 필요한 힌트 길이를 예측합니다. 이러한 인스턴스 수준의 실시간 난이도 조정은 문제 난이도를 진화하는 모델 능력과 일치시켜 탐색 효율성을 향상시킵니다. 실험 결과, SEELE는 그룹 상대 정책 최적화(GRPO)와 지도 미세 조정(SFT)을 각각 +11.8점과 +10.5점으로 능가하며, 여섯 가지 수학 추론 벤치마크에서 평균적으로 이전 최고의 지도 학습 지원 접근법보다 +3.6점 높은 성능을 보였습니다.

English

Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in enhancing the reasoning capabilities of large language models (LLMs). However, existing RLVR methods often suffer from exploration inefficiency due to mismatches between the training data's difficulty and the model's capability. LLMs fail to discover viable reasoning paths when problems are overly difficult, while learning little new capability when problems are too simple. In this work, we formalize the impact of problem difficulty by quantifying the relationship between loss descent speed and rollout accuracy. Building on this analysis, we propose SEELE, a novel supervision-aided RLVR framework that dynamically adjusts problem difficulty to stay within the high-efficiency region. SEELE augments each training sample by appending a hint (part of a full solution) after the original problem. Unlike previous hint-based approaches, SEELE deliberately and adaptively adjusts the hint length for each problem to achieve an optimal difficulty. To determine the optimal hint length, SEELE employs a multi-round rollout sampling strategy. In each round, it fits an item response theory model to the accuracy-hint pairs collected in preceding rounds to predict the required hint length for the next round. This instance-level, real-time difficulty adjustment aligns problem difficulty with the evolving model capability, thereby improving exploration efficiency. Experimental results show that SEELE outperforms Group Relative Policy Optimization (GRPO) and Supervised Fine-tuning (SFT) by +11.8 and +10.5 points, respectively, and surpasses the best previous supervision-aided approach by +3.6 points on average across six math reasoning benchmarks.

스위트 스팟 유지: 능력 적응형 힌트 스캐폴딩을 통한 반응형 추론 진화

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

초록

Support