擴展強化學習於多領域中的可驗證獎勵機制

摘要

具有可驗證獎勵的強化學習（RLVR）在數學推理和編程任務中已展現出顯著成果，這些任務通常具備結構化的參考答案。然而，其在更廣泛領域的適用性仍待深入探索。本研究探討了將RLVR擴展至醫學、化學、心理學及經濟學等多樣化領域的可能性。我們觀察到，當存在客觀參考答案時，不同大型語言模型（LLMs）在二元判斷上表現出高度一致性，這對大規模標註訓練領域特定獎勵模型的必要性提出了挑戰。為解決處理非結構化參考答案時二元獎勵的局限性，我們進一步將基於模型的軟評分融入RLVR，以提升其靈活性。實驗表明，蒸餾生成的獎勵模型能作為有效的跨領域驗證器，為RL提供可靠的獎勵信號，而無需領域特定的標註。通過使用多種RL算法針對我們的獎勵模型微調一個基礎的7B模型，我們獲得的策略在自由形式答案設置下，跨領域大幅超越了如Qwen2.5-72B-Instruct和DeepSeek-R1-Distill-Qwen-32B等最先進的開源對齊LLMs。這也增強了RLVR的魯棒性和可擴展性，凸顯了其在面對噪聲或弱標籤的現實世界應用中的潛力。

English

Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

擴展強化學習於多領域中的可驗證獎勵機制

Expanding RL with Verifiable Rewards Across Diverse Domains

摘要

Support