在數學推理中橋接監督學習與強化學習

摘要

強化學習（RL）在近期大型語言模型（LLMs）數學能力提升中扮演了核心角色，通過二元驗證信號實現自我改進。相比之下，監督學習（SL）很少被考慮用於此類驗證驅動的訓練，主要因其過度依賴參考答案且無法反思錯誤。在本研究中，我們挑戰了自我改進僅限於RL的普遍觀念，提出了負面感知微調（NFT）——一種監督式方法，使LLMs能夠反思其失敗並在無外部指導的情況下自主改進。在線上訓練中，NFT並未丟棄自我生成的負面答案，而是構建了一個隱含的負面策略來對其建模。此隱含策略與我們旨在優化正面數據的同一正向LLM參數化，從而實現對所有LLM生成內容的直接策略優化。我們在7B和32B模型上進行了數學推理任務的實驗。結果一致表明，通過額外利用負面反饋，NFT相較於如拒絕採樣微調等SL基線有顯著提升，匹配甚至超越了如GRPO和DAPO等領先的RL算法。此外，我們證明了在嚴格策略訓練下，NFT與GRPO實際上是等價的，儘管它們源自完全不同的理論基礎。我們的實驗與理論發現彌合了SL與RL方法在二元反饋學習系統中的差距。

English

Reinforcement Learning (RL) has played a central role in the recent surge of LLMs' math abilities by enabling self-improvement through binary verifier signals. In contrast, Supervised Learning (SL) is rarely considered for such verification-driven training, largely due to its heavy reliance on reference answers and inability to reflect on mistakes. In this work, we challenge the prevailing notion that self-improvement is exclusive to RL and propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers. In online training, instead of throwing away self-generated negative answers, NFT constructs an implicit negative policy to model them. This implicit policy is parameterized with the same positive LLM we target to optimize on positive data, enabling direct policy optimization on all LLMs' generations. We conduct experiments on 7B and 32B models in math reasoning tasks. Results consistently show that through the additional leverage of negative feedback, NFT significantly improves over SL baselines like Rejection sampling Fine-Tuning, matching or even surpassing leading RL algorithms like GRPO and DAPO. Furthermore, we demonstrate that NFT and GRPO are actually equivalent in strict-on-policy training, even though they originate from entirely different theoretical foundations. Our experiments and theoretical findings bridge the gap between SL and RL methods in binary-feedback learning systems.

在數學推理中橋接監督學習與強化學習

Bridging Supervised Learning and Reinforcement Learning in Math Reasoning

摘要

Support