SPARK: 시너지 정책 및 보상 공진화 프레임워크

초록

최근의 대형 언어 모델(LLM)과 대형 시각-언어 모델(LVLM)은 사전 학습 후 강화 학습(RL)을 점점 더 많이 사용하고 있으며, 이는 객관적 작업을 위한 검증 가능한 보상 기반 RL(RLVR)과 주관적 작업을 위한 인간 피드백 기반 RL(RLHF)을 포함합니다. 그러나 RLHF는 인간 선호도에 의존하기 때문에 높은 비용과 잠재적인 보상-정책 불일치 문제를 초래하며, RLVR은 각 업데이트 후 롤아웃과 정확도 신호를 폐기함으로써 여전히 감독을 낭비합니다. 이러한 문제를 해결하기 위해, 우리는 RLVR을 기반으로 한 효율적이고 온-정책적이며 안정적인 방법인 시너지 정책 및 보상 공진화 프레임워크(SPARK)를 소개합니다. SPARK는 롤아웃과 정확도 데이터를 폐기하는 대신, 이 소중한 정보를 재활용하여 모델 자체를 생성적 보상 모델로 동시에 학습시킵니다. 이 보조 학습은 점별 보상 점수, 쌍별 비교, 추가 반성 응답에 기반한 평가 등 다양한 목적을 혼합하여 모델이 자신의 응답을 평가하고 개선하도록 가르칩니다. 이 과정은 별도의 보상 모델과 비용이 많이 드는 인간 선호도 데이터의 필요성을 제거합니다. SPARK는 긍정적인 공진화 피드백 루프를 생성합니다: 개선된 보상 정확도는 더 나은 정책 기울기를 산출하고, 이는 다시 더 높은 품질의 롤아웃을 생성하여 보상 모델을 더욱 세밀하게 조정합니다. 우리의 통합 프레임워크는 외부 보상 모델과 그에 따른 비용 없이 자기 반성을 통해 테스트 시 확장을 지원합니다. 우리는 SPARK가 여러 LLM 및 LVLM 모델과 다중 추론, 보상 모델, 일반 벤치마크에서 상당한 성능 향상을 달성함을 보여줍니다. 예를 들어, SPARK-VL-7B는 7개의 추론 벤치마크에서 평균 9.7%, 2개의 보상 벤치마크에서 12.1%, 8개의 일반 벤치마크에서 1.5%의 향상을 보여주며, 견고성과 광범위한 일반화를 입증합니다.

English

Recent Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) increasingly use Reinforcement Learning (RL) for post-pretraining, such as RL with Verifiable Rewards (RLVR) for objective tasks and RL from Human Feedback (RLHF) for subjective tasks. However, RLHF incurs high costs and potential reward-policy mismatch due to reliance on human preferences, while RLVR still wastes supervision by discarding rollouts and correctness signals after each update. To address these challenges, we introduce the Synergistic Policy And Reward Co-Evolving Framework (SPARK), an efficient, on-policy, and stable method that builds on RLVR. Instead of discarding rollouts and correctness data, SPARK recycles this valuable information to simultaneously train the model itself as a generative reward model. This auxiliary training uses a mix of objectives, such as pointwise reward score, pairwise comparison, and evaluation conditioned on further-reflection responses, to teach the model to evaluate and improve its own responses. Our process eliminates the need for a separate reward model and costly human preference data. SPARK creates a positive co-evolving feedback loop: improved reward accuracy yields better policy gradients, which in turn produce higher-quality rollouts that further refine the reward model. Our unified framework supports test-time scaling via self-reflection without external reward models and their associated costs. We show that SPARK achieves significant performance gains on multiple LLM and LVLM models and multiple reasoning, reward models, and general benchmarks. For example, SPARK-VL-7B achieves an average 9.7% gain on 7 reasoning benchmarks, 12.1% on 2 reward benchmarks, and 1.5% on 8 general benchmarks over the baselines, demonstrating robustness and broad generalization.

SPARK: 시너지 정책 및 보상 공진화 프레임워크

SPARK: Synergistic Policy And Reward Co-Evolving Framework

초록

Support