NOVER:基於無驗證器強化學習的語言模型激勵訓練
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
May 21, 2025
作者: Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He
cs.AI
摘要
近期如DeepSeek R1-Zero的进展凸显了激励训练的有效性,这是一种强化学习范式,其奖励计算仅基于语言模型输出的最终答案部分,从而鼓励生成中间推理步骤。然而,这些方法从根本上依赖于外部验证器,这限制了它们在数学和编程等验证器易于获取的领域中的应用。尽管奖励模型可作为验证器,但它们需要高质量标注数据且训练成本高昂。在本研究中,我们提出了NOVER,即无验证器强化学习框架,这是一个仅需标准监督微调数据、无需外部验证器的通用强化学习框架。NOVER能够在广泛的文本到文本任务中实现激励训练,并在相同规模模型上,相较于从DeepSeek R1 671B等大型推理模型蒸馏出的模型,性能提升了7.7%。此外,NOVER的灵活性为优化大型语言模型开辟了新的可能性,例如逆向激励训练。
English
Recent advances such as DeepSeek R1-Zero highlight the effectiveness of
incentive training, a reinforcement learning paradigm that computes rewards
solely based on the final answer part of a language model's output, thereby
encouraging the generation of intermediate reasoning steps. However, these
methods fundamentally rely on external verifiers, which limits their
applicability to domains like mathematics and coding where such verifiers are
readily available. Although reward models can serve as verifiers, they require
high-quality annotated data and are costly to train. In this work, we propose
NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning
framework that requires only standard supervised fine-tuning data with no need
for an external verifier. NOVER enables incentive training across a wide range
of text-to-text tasks and outperforms the model of the same size distilled from
large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the
flexibility of NOVER enables new possibilities for optimizing large language
models, such as inverse incentive training.Summary
AI-Generated Summary