検証器なしでの汎用推論の強化

要旨

最近、検証可能な報酬を用いたDeepSeek-R1-Zeroスタイルの強化学習（RL）によって大規模言語モデル（LLM）を訓練するというパラダイムシフトが起こり、コードや数学的推論において目覚ましい進展が見られています。しかし、この方法論はルールベースの回答検証が可能なタスクに限定されており、化学、医療、工学、法律、生物学、ビジネス、経済学などの現実世界の領域には自然に拡張されません。現在の実用的な回避策として、追加のLLMをモデルベースの検証器として使用していますが、これには強力な検証器LLMへの依存、報酬ハッキングへの脆弱性、および訓練中に検証器モデルをメモリに保持するという実用的な負担といった問題が生じます。この問題を解決し、DeepSeek-R1-Zeroスタイルの訓練を一般的な推論領域に拡張するために、私たちは回答検証をバイパスし、代わりにRLを使用して参照回答を生成する確率を直接最大化する検証器不要の方法（VeriFree）を提案します。VeriFreeを検証器ベースの方法と比較し、MMLU-Pro、GPQA、SuperGPQA、および数学関連のベンチマークにわたる広範な評価において、VeriFreeが実用的な利点と計算要件の削減に加えて、検証器ベースの方法に匹敵し、それを上回ることを実証します。さらに、この方法について、ポリシーと暗黙の検証器を統一モデルで訓練するエレガントな統合として、および変分最適化アプローチとして、複数の視点から洞察を提供します。コードはhttps://github.com/sail-sg/VeriFreeで公開されています。

English

The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.

検証器なしでの汎用推論の強化

Reinforcing General Reasoning without Verifiers

要旨

Support