수학적 추론을 위한 자기 보상적 교정

초록

우리는 추론 과정에서 단계별 추론을 생성함과 동시에 외부 피드백 없이도 자신의 출력 결과를 평가할 수 있는 자기 보상형 추론 대형 언어 모델(LLMs)을 연구한다. 이 통합 접근법은 단일 모델이 독립적으로 자신의 추론 과정을 이끌어갈 수 있게 하여, 모델 배포에 있어 계산적 이점을 제공한다. 특히, 모델이 자율적으로 응답의 오류를 감지하고 출력을 수정하며 반복적 정제 루프를 언제 종료할지 결정하는 자기 수정(self-correction)이라는 대표적인 작업에 초점을 맞춘다. 이를 위해, 우리는 자기 생성 데이터만을 사용하여 자기 보상형 추론 모델을 구축하기 위한 두 단계의 알고리즘 프레임워크를 제안한다. 첫 번째 단계에서는 자기 보상 및 자기 수정 메커니즘을 포함한 긴 사고 사슬(chain-of-thought) 궤적을 합성하기 위해 순차적 거부 샘플링(sequential rejection sampling)을 활용한다. 이러한 정제된 데이터를 통해 모델을 미세 조정함으로써, 모델이 자기 보상 및 자기 수정 패턴을 학습할 수 있게 한다. 두 번째 단계에서는 규칙 기반 신호를 활용한 강화 학습(reinforcement learning)을 통해 모델의 응답 정확도 평가 및 출력 정제 능력을 더욱 강화한다. Llama-3 및 Qwen-2.5를 사용한 실험 결과, 우리의 접근법은 내재적 자기 수정 능력을 뛰어넘으며 외부 보상 모델에 의존하는 시스템과 비슷한 성능을 달성함을 보여준다.

English

We study self-rewarding reasoning large language models (LLMs), which can simultaneously generate step-by-step reasoning and evaluate the correctness of their outputs during the inference time-without external feedback. This integrated approach allows a single model to independently guide its reasoning process, offering computational advantages for model deployment. We particularly focus on the representative task of self-correction, where models autonomously detect errors in their responses, revise outputs, and decide when to terminate iterative refinement loops. To enable this, we propose a two-staged algorithmic framework for constructing self-rewarding reasoning models using only self-generated data. In the first stage, we employ sequential rejection sampling to synthesize long chain-of-thought trajectories that incorporate both self-rewarding and self-correction mechanisms. Fine-tuning models on these curated data allows them to learn the patterns of self-rewarding and self-correction. In the second stage, we further enhance the models' ability to assess response accuracy and refine outputs through reinforcement learning with rule-based signals. Experiments with Llama-3 and Qwen-2.5 demonstrate that our approach surpasses intrinsic self-correction capabilities and achieves performance comparable to systems that rely on external reward models.

수학적 추론을 위한 자기 보상적 교정

Self-rewarding correction for mathematical reasoning

초록

Support