LaSeR: 마지막 토큰 자기 보상을 통한 강화 학습

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 최근 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키기 위한 핵심 패러다임으로 부상했습니다. 테스트 시점에서 검증 신호가 부족한 문제를 해결하기 위해, 기존 연구들은 모델의 자체 검증 능력 훈련을 표준 RLVR 프로세스에 통합함으로써 단일 LLM 내에서 추론과 검증 능력을 통합했습니다. 그러나 기존 방식은 두 개의 별도 프롬프트 템플릿을 사용해 해결책과 자체 검증을 순차적으로 생성하도록 요구함으로써 효율성을 크게 저하시켰습니다. 본 연구에서는 자체 검증의 RL 목적 함수에 대한 폐쇄형 해가 놀랍도록 간단한 형태로 축소될 수 있음을 이론적으로 밝혔습니다: 해결책의 진정한 추론 보상은 해당 해결책의 마지막 토큰에서의 자체 보상 점수와 동일하며, 이는 정책 모델이 해결책의 마지막 토큰에서 미리 지정된 토큰에 할당한 다음 토큰 로그 확률과 사전 계산된 상수 간의 차이를 KL 계수로 스케일링한 값으로 계산됩니다. 이러한 통찰을 바탕으로, 우리는 LaSeR(Reinforcement Learning with Last-Token Self-Rewarding) 알고리즘을 제안합니다. 이 알고리즘은 원래의 RLVR 손실에 마지막 토큰 자체 보상 점수와 검증 기반 추론 보상을 정렬하는 MSE 손실을 단순히 추가함으로써 LLM의 추론 및 자체 보상 능력을 공동으로 최적화합니다. 최적화된 자체 보상 점수는 훈련 및 테스트 모두에서 모델 성능을 향상시키는 데 활용될 수 있습니다. 특히, 우리의 알고리즘은 이러한 점수를 생성 직후 마지막 토큰의 예측된 다음 토큰 확률 분포에서 도출함으로써 단 하나의 추가 토큰 추론이라는 최소한의 추가 비용만을 발생시킵니다. 실험 결과, 우리의 방법은 모델의 추론 성능을 향상시킬 뿐만 아니라 놀라운 자체 보상 능력을 부여함으로써 추론 시점의 스케일링 성능을 크게 향상시킴을 보여줍니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.

LaSeR: 마지막 토큰 자기 보상을 통한 강화 학습

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

초록

Support