ChatPaper.aiChatPaper

在数学推理中架起监督学习与强化学习的桥梁

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

May 23, 2025
作者: Huayu Chen, Kaiwen Zheng, Qinsheng Zhang, Ganqu Cui, Yin Cui, Haotian Ye, Tsung-Yi Lin, Ming-Yu Liu, Jun Zhu, Haoxiang Wang
cs.AI

摘要

强化学习(RL)在近期大语言模型(LLMs)数学能力的显著提升中扮演了核心角色,它通过二元验证信号实现了自我改进。相比之下,监督学习(SL)很少被考虑用于此类验证驱动的训练,主要原因是其过度依赖参考答案且无法反思错误。在本研究中,我们挑战了自我改进仅属于RL的普遍观念,提出了负感知微调(NFT)——一种监督方法,使LLMs能够反思其失败并在无外部指导的情况下自主改进。在在线训练中,NFT并未丢弃自生成的负面答案,而是构建了一个隐式负策略来建模这些负面答案。该隐式策略与我们在正样本数据上优化的同一正向LLM参数化,从而实现了对所有LLM生成结果的直接策略优化。我们在7B和32B模型上进行了数学推理任务的实验。结果一致表明,通过额外利用负面反馈,NFT显著超越了如拒绝采样微调等SL基线,匹配甚至超越了GRPO和DAPO等领先的RL算法。此外,我们证明了在严格在线策略训练下,NFT与GRPO实际上是等价的,尽管它们源自完全不同的理论基础。我们的实验与理论发现弥合了SL与RL方法在二元反馈学习系统中的鸿沟。
English
Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

Summary

AI-Generated Summary

PDF42May 27, 2025