수학적 추론에서 지도 학습과 강화 학습의 간극 메우기

초록

강화 학습(Reinforcement Learning, RL)은 이진 검증 신호를 통해 자기 개선을 가능하게 함으로써 최근 대형 언어 모델(LLM)의 수학 능력 향상에 핵심적인 역할을 해왔습니다. 반면, 지도 학습(Supervised Learning, SL)은 참조 답안에 대한 과도한 의존성과 실패를 반영하지 못하는 한계로 인해 이러한 검증 기반 훈련에서는 거의 고려되지 않았습니다. 본 연구에서는 자기 개선이 RL에만 국한된다는 기존의 통념에 도전하고, 외부 교사 없이도 LLM이 실패를 반영하고 자율적으로 개선할 수 있도록 하는 지도 학습 기반의 접근법인 '부정 인식 미세 조정(Negative-aware Fine-Tuning, NFT)'을 제안합니다. 온라인 훈련에서 NFT는 자체 생성된 부정적인 답변을 버리지 않고, 이를 모델링하기 위한 암묵적 부정 정책을 구성합니다. 이 암묵적 정책은 긍정적 데이터에 최적화하고자 하는 동일한 긍정적 LLM으로 매개변수화되어, 모든 LLM 생성물에 대한 직접적인 정책 최적화를 가능하게 합니다. 우리는 7B 및 32B 모델을 대상으로 수학 추론 과제에서 실험을 수행했습니다. 결과는 부정적 피드백을 추가적으로 활용함으로써 NFT가 거부 샘플링 미세 조정(Rejection sampling Fine-Tuning)과 같은 SL 기준선을 크게 능가하며, GRPO 및 DAPO와 같은 선도적인 RL 알고리즘과 동등하거나 오히려 뛰어난 성능을 보임을 일관되게 보여줍니다. 더 나아가, NFT와 GRPO가 완전히 다른 이론적 기반에서 출발했음에도 불구하고 엄격한 온-정책 훈련에서 실제로 동등함을 입증합니다. 우리의 실험과 이론적 발견은 이진 피드백 학습 시스템에서 SL과 RL 방법 간의 간극을 메우는 데 기여합니다.

English

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

수학적 추론에서 지도 학습과 강화 학습의 간극 메우기

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

초록

Support