視頻模型可利用可驗證獎勵進行推理

摘要

視頻擴散模型在感知真實性與時間連貫性方面進展迅速，但其主要針對看似合理的生成進行優化，而非可驗證的推理。此局限在生成的影片必須滿足明確的空間、時間或邏輯約束的任務中尤為明顯。受強化學習結合可驗證獎勵（RLVR）在推理導向語言模型中所扮演角色的啟發，我們引入了VideoRLVR，這是一個基於規則反饋來優化影片擴散模型的實用方案。VideoRLVR將影片推理表述為生成可驗證的視覺軌跡，並包含SDE-GRPO優化主幹、密集分解獎勵，以及用於高效訓練的早期步驟聚焦策略。該策略將策略優化限制在早期去噪階段，在維持性能的同時，將訓練延遲降低約40%。我們在Maze、FlowFree和Sokoban這三個具備客觀成功標準的程序生成領域中評估了VideoRLVR。在這些任務中，VideoRLVR持續優於監督式微調基線，而密集分解獎勵在成功率較低的設定中尤為重要。我們經過強化學習優化的模型在這些可驗證推理基準與域外基準上，也優於所評估的專有及開源影片生成模型。這些結果表明，可驗證的強化學習能將影片模型從感知模仿，推向更可靠的規則一致之視覺推理。

English

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.