多様な領域にわたる検証可能な報酬を用いた強化学習の拡張

要旨

検証可能な報酬を伴う強化学習（RLVR）は、構造化された参照回答が利用可能な数学的推論やコーディングタスクにおいて有望な結果を示しています。しかし、より広範な領域への適用性はまだ十分に検討されていません。本研究では、RLVRを医学、化学、心理学、経済学など多様な領域に拡張することを検討します。客観的な参照回答が存在する場合、異なる大規模言語モデル（LLM）間で二値判断の高い一致が見られ、これにより領域固有の報酬モデルを訓練するための大規模なアノテーションの必要性が問われます。非構造化の参照回答を扱う際の二値報酬の限界に対処するため、モデルベースのソフトスコアリングをRLVRに組み込み、その柔軟性を向上させます。実験では、蒸留された生成型報酬モデルが効果的なクロスドメイン検証器として機能し、領域固有のアノテーションを必要とせずにRLに信頼性の高い報酬信号を提供できることが示されています。7Bのベースモデルを様々なRLアルゴリズムで報酬モデルに対してファインチューニングすることで、Qwen2.5-72B-InstructやDeepSeek-R1-Distill-Qwen-32Bなどの最先端のオープンソース整列LLMを大幅に上回るポリシーを、自由形式の回答設定において複数の領域で得ることができました。これにより、RLVRの堅牢性と拡張性が強化され、ノイズの多いまたは弱いラベルを伴う実世界のアプリケーションにおける潜在的可能性が強調されています。

English

Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

多様な領域にわたる検証可能な報酬を用いた強化学習の拡張

Expanding RL with Verifiable Rewards Across Diverse Domains

要旨

Support