ChatPaper.aiChatPaper

视频模型能够使用可验证奖励进行推理

Video Models Can Reason with Verifiable Rewards

May 14, 2026
作者: Tinghui Zhu, Sheng Zhang, James Y. Huang, Selena Song, Xiaofei Wen, Yuankai Li, Hoifung Poon, Muhao Chen
cs.AI

摘要

视频扩散模型在感知真实性和时间连贯性方面取得了快速进展,但其优化目标仍主要侧重于生成合理的画面,而非可验证的推理。这一局限性在生成视频必须满足明确的空间、时间或逻辑约束的任务中尤为突出。受可验证奖励强化学习(RLVR)在面向推理的语言模型中所起作用的启发,我们提出了VideoRLVR,这是一种通过基于规则的反馈优化视频扩散模型的实用方法。VideoRLVR将视频推理建模为可验证视觉轨迹的生成,其核心包括SDE-GRPO优化框架、密集分解奖励以及用于高效训练的早期步骤聚焦策略。早期步骤聚焦策略将策略优化限制在早期去噪阶段,可在保持性能的同时将训练延迟降低约40%。我们在Maze、FlowFree和Sokoban这三个具有客观成功标准的程序化生成域上评估了VideoRLVR。在这些任务中,VideoRLVR一致地优于监督微调基线,其中密集分解奖励在低成功率设置下尤为关键。我们的RL优化模型在这些可验证推理基准测试以及域外基准测试中,也优于所评估的专有和开源视频生成模型。这些结果表明,可验证的RL能够推动视频模型超越感知模仿,走向更可靠的、符合规则的视觉推理。
English
Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.