混合强化学习：当奖励稀疏时，密集策略更优

摘要

大型语言模型（LLM）推理的后训练日益依赖于可验证的奖励：即提供0-1正确性信号的确定性检查器。尽管这种二元反馈可靠，但它显得脆弱——许多任务允许部分正确或替代答案，而验证器往往低估了这些答案，由此产生的全有或全无的监督限制了学习效果。奖励模型提供了更为丰富、连续的反馈，可作为验证器监督信号的补充。我们引入了HERO（混合集成奖励优化），这是一个强化学习框架，它以结构化的方式将验证器信号与奖励模型评分相结合。HERO采用分层归一化方法，将奖励模型评分限定在验证器定义的组别内，在保持正确性的同时细化质量区分，并通过方差感知加权来强调那些密集信号最为关键的挑战性提示。在多样化的数学推理基准测试中，HERO始终优于仅使用奖励模型或仅依赖验证器的基线方法，在可验证及难以验证的任务上均取得了显著提升。我们的结果表明，混合奖励设计既保留了验证器的稳定性，又充分利用了奖励模型的细微差别，从而推动了推理能力的进步。

English

Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

混合强化学习：当奖励稀疏时，密集策略更优

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

摘要

Support