VR-思維者:通過圖像思維推理增強視頻獎勵模型
VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning
October 12, 2025
作者: Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu
cs.AI
摘要
近期,多模态奖励模型(RMs)的进展显著提升了视觉生成模型的训练后效果。然而,现有RMs面临固有局限:(1)视觉输入消耗大量上下文预算,迫使帧数减少,导致细粒度细节丢失;(2)所有视觉信息均压缩至初始提示中,加剧了链式推理过程中的幻觉与遗忘现象。为克服这些问题,我们引入了VideoReward Thinker(VR-Thinker),一种“图像思维”框架,该框架赋予RM视觉推理操作(如选择帧)及可配置的视觉记忆窗口。这使得RM能在上下文限制内主动获取并更新视觉证据,提升推理的准确性与可靠性。我们通过强化微调管道激活视觉推理:(i)冷启动阶段,利用精选的视觉链式思维数据,提炼基础推理技能与操作格式化;(ii)筛选出各维度及整体判断均正确的样本,随后对这些高质量轨迹进行拒绝采样微调,以进一步增强推理能力;(iii)应用群体相对策略优化(GRPO)强化推理。我们的方法在视频偏好基准测试中,尤其是在长视频上,展现了开源模型中的顶尖准确率:7B参数的VR-Thinker在VideoGen Reward上达到80.5%,在GenAI-Bench上为82.3%,在MJ-Bench-Video上为75.6%。这些结果验证了“图像思维”多模态奖励建模的有效性与前景。
English
Recent advancements in multimodal reward models (RMs) have substantially
improved post-training for visual generative models. However, current RMs face
inherent limitations: (1) visual inputs consume large context budgets, forcing
fewer frames and causing loss of fine-grained details; and (2) all visual
information is packed into the initial prompt, exacerbating hallucination and
forgetting during chain-of-thought reasoning. To overcome these issues, we
introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework
that equips the RM with visual reasoning operations (e.g., select frame) and a
configurable visual memory window. This allows the RM to actively acquire and
update visual evidence within context limits, improving reasoning fidelity and
reliability. We activate visual reasoning via a reinforcement fine-tuning
pipeline: (i) Cold Start with curated visual chain-of-thought data to distill
basic reasoning skills and operation formatting; (ii) select samples whose
per-dimension and overall judgments are all correct, then conduct Rejection
sampling Fine-Tuning on these high-quality traces to further enhance reasoning;
and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen
reasoning. Our approach delivers state-of-the-art accuracy among open-source
models on video preference benchmarks, especially for longer videos: a 7B
VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6%
on MJ-Bench-Video. These results validate the effectiveness and promise of
thinking-with-image multimodal reward modeling.