ChatPaper.aiChatPaper

混合强化学习:当奖励稀疏时,密集策略更优

Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

October 8, 2025
作者: Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu
cs.AI

摘要

大型语言模型(LLM)推理的后训练日益依赖于可验证的奖励:即提供0-1正确性信号的确定性检查器。尽管这种二元反馈可靠,但它显得脆弱——许多任务允许部分正确或替代答案,而验证器往往低估了这些答案,由此产生的全有或全无的监督限制了学习效果。奖励模型提供了更为丰富、连续的反馈,可作为验证器监督信号的补充。我们引入了HERO(混合集成奖励优化),这是一个强化学习框架,它以结构化的方式将验证器信号与奖励模型评分相结合。HERO采用分层归一化方法,将奖励模型评分限定在验证器定义的组别内,在保持正确性的同时细化质量区分,并通过方差感知加权来强调那些密集信号最为关键的挑战性提示。在多样化的数学推理基准测试中,HERO始终优于仅使用奖励模型或仅依赖验证器的基线方法,在可验证及难以验证的任务上均取得了显著提升。我们的结果表明,混合奖励设计既保留了验证器的稳定性,又充分利用了奖励模型的细微差别,从而推动了推理能力的进步。
English
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.
PDF272October 10, 2025