ChatPaper.aiChatPaper

SPARK:協同策略與獎勵共演化框架

SPARK: Synergistic Policy And Reward Co-Evolving Framework

September 26, 2025
作者: Ziyu Liu, Yuhang Zang, Shengyuan Ding, Yuhang Cao, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
cs.AI

摘要

近期的大型语言模型(LLMs)与大型视觉语言模型(LVLMs)越来越多地采用强化学习(RL)进行预训练后的优化,例如针对客观任务的可验证奖励强化学习(RLVR)及针对主观任务的人类反馈强化学习(RLHF)。然而,RLHF因依赖人类偏好而成本高昂且可能导致奖励策略失配,而RLVR则在每次更新后丢弃探索轨迹与正确性信号,造成监督资源的浪费。为解决这些问题,我们提出了协同策略与奖励共进化框架(SPARK),这是一种高效、基于策略且稳定的方法,建立在RLVR基础之上。SPARK不再舍弃探索轨迹与正确性数据,而是回收这些宝贵信息,同时将模型自身训练为生成式奖励模型。此辅助训练采用混合目标,如点状奖励评分、成对比较及基于进一步反思响应的评估,以教导模型评估并改进其自身响应。我们的流程无需独立的奖励模型及昂贵的人类偏好数据。SPARK构建了一个正向的共进化反馈循环:奖励准确性的提升带来更优的策略梯度,进而产生更高质量的探索轨迹,进一步精炼奖励模型。这一统一框架支持通过自我反思实现测试时的扩展,无需外部奖励模型及其相关成本。我们展示SPARK在多个LLM与LVLM模型及多种推理、奖励模型与通用基准测试上均取得了显著的性能提升。例如,SPARK-VL-7B在7个推理基准上平均提升9.7%,在2个奖励基准上提升12.1%,在8个通用基准上提升1.5%,展现了其鲁棒性与广泛的泛化能力。
English
Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.
PDF152September 29, 2025