NOVER: 検証器不要の強化学習による言語モデルのインセンティブトレーニング

要旨

DeepSeek R1-Zeroなどの最近の進歩は、インセンティブトレーニングの有効性を示しています。これは、言語モデルの出力の最終回答部分のみに基づいて報酬を計算する強化学習パラダイムであり、中間の推論ステップの生成を促進します。しかし、これらの手法は基本的に外部検証器に依存しており、数学やコーディングなど検証器が容易に利用可能な領域に適用が限定されます。報酬モデルは検証器として機能し得ますが、高品質な注釈付きデータを必要とし、訓練にコストがかかります。本研究では、NOVER（NO-VERifier Reinforcement Learning）を提案します。これは、外部検証器を必要とせず、標準的な教師ありファインチューニングデータのみを必要とする一般的な強化学習フレームワークです。NOVERは、幅広いテキスト間タスクにわたるインセンティブトレーニングを可能にし、DeepSeek R1 671Bなどの大規模推論モデルから蒸留した同じサイズのモデルを7.7％上回ります。さらに、NOVERの柔軟性は、逆インセンティブトレーニングなど、大規模言語モデルを最適化する新たな可能性を可能にします。

English

Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

NOVER: 検証器不要の強化学習による言語モデルのインセンティブトレーニング

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

要旨

Support