混合式強化學習：當獎勵稀疏時，密集更佳

摘要

大型語言模型（LLMs）的推理後訓練日益依賴可驗證的獎勵：提供0-1正確性信號的確定性檢查器。雖然可靠，但這種二元反饋是脆弱的——許多任務允許部分正確或替代答案，而驗證器往往低估了這些答案，由此產生的全有或全無的監督限制了學習。獎勵模型提供了更豐富、連續的反饋，可以作為驗證器的補充監督信號。我們引入了HERO（混合集成獎勵優化），這是一個強化學習框架，它以結構化的方式整合了驗證器信號與獎勵模型分數。HERO採用分層歸一化，將獎勵模型分數限制在驗證器定義的組內，在保持正確性的同時細化質量區分，並使用方差感知加權來強調密集信號最為關鍵的挑戰性提示。在各種數學推理基準測試中，HERO始終優於僅使用獎勵模型和僅使用驗證器的基線，無論是在可驗證還是難以驗證的任務上都取得了顯著的提升。我們的結果表明，混合獎勵設計在保持驗證器穩定性的同時，利用獎勵模型的細微差別來推進推理。

English

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

混合式強化學習：當獎勵稀疏時，密集更佳

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

摘要

Support