混合式強化學習:當獎勵稀疏時,密集更佳
Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense
October 8, 2025
作者: Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu
cs.AI
摘要
大型語言模型(LLMs)的推理後訓練日益依賴可驗證的獎勵:提供0-1正確性信號的確定性檢查器。雖然可靠,但這種二元反饋是脆弱的——許多任務允許部分正確或替代答案,而驗證器往往低估了這些答案,由此產生的全有或全無的監督限制了學習。獎勵模型提供了更豐富、連續的反饋,可以作為驗證器的補充監督信號。我們引入了HERO(混合集成獎勵優化),這是一個強化學習框架,它以結構化的方式整合了驗證器信號與獎勵模型分數。HERO採用分層歸一化,將獎勵模型分數限制在驗證器定義的組內,在保持正確性的同時細化質量區分,並使用方差感知加權來強調密集信號最為關鍵的挑戰性提示。在各種數學推理基準測試中,HERO始終優於僅使用獎勵模型和僅使用驗證器的基線,無論是在可驗證還是難以驗證的任務上都取得了顯著的提升。我們的結果表明,混合獎勵設計在保持驗證器穩定性的同時,利用獎勵模型的細微差別來推進推理。
English
Post-training for reasoning of large language models (LLMs) increasingly
relies on verifiable rewards: deterministic checkers that provide 0-1
correctness signals. While reliable, such binary feedback is brittle--many
tasks admit partially correct or alternative answers that verifiers
under-credit, and the resulting all-or-nothing supervision limits learning.
Reward models offer richer, continuous feedback, which can serve as a
complementary supervisory signal to verifiers. We introduce HERO (Hybrid
Ensemble Reward Optimization), a reinforcement learning framework that
integrates verifier signals with reward-model scores in a structured way. HERO
employs stratified normalization to bound reward-model scores within
verifier-defined groups, preserving correctness while refining quality
distinctions, and variance-aware weighting to emphasize challenging prompts
where dense signals matter most. Across diverse mathematical reasoning
benchmarks, HERO consistently outperforms RM-only and verifier-only baselines,
with strong gains on both verifiable and hard-to-verify tasks. Our results show
that hybrid reward design retains the stability of verifiers while leveraging
the nuance of reward models to advance reasoning.