비디오 모델은 검증 가능한 보상으로 추론할 수 있다

초록

비디오 확산 모델은 지각적 사실성과 시간적 일관성 측면에서 빠른 발전을 이루었지만, 여전히 검증 가능한 추론보다는 그럴듯한 생성에 최적화되어 있다. 이러한 한계는 생성된 비디오가 명시적인 공간적, 시간적, 또는 논리적 제약 조건을 반드시 충족해야 하는 작업에서 특히 두드러진다. 추론 중심 언어 모델에서 검증 가능한 보상이 있는 강화 학습(RLVR)의 역할에서 영감을 얻어, 우리는 규칙 기반 피드백으로 비디오 확산 모델을 최적화하기 위한 실용적인 레시피인 VideoRLVR를 제안한다. VideoRLVR는 비디오 추론을 검증 가능한 시각적 궤적의 생성으로 정식화하며, SDE-GRPO 최적화 백본, 조밀 분해 보상, 그리고 효율적인 훈련을 위한 초기 단계 집중 전략으로 구성된다. 초기 단계 집중 전략은 정책 최적화를 초기 잡음 제거 단계로 제한하여 성능을 유지하면서 훈련 지연 시간을 약 40% 감소시킨다. 우리는 VideoRLVR를 객관적 성공 기준이 있는 절차적으로 생성된 세 가지 도메인인 Maze, FlowFree, Sokoban에서 평가한다. 이러한 작업 전반에 걸쳐 VideoRLVR는 지도 미세 조정 기준선보다 일관되게 향상된 성능을 보이며, 특히 낮은 성공률 환경에서는 조밀 분해 보상이 중요한 역할을 한다. 강화 학습으로 최적화된 우리의 모델은 이러한 검증 가능한 추론 벤치마크와 도메인 외 벤치마크에서 평가된 독점 및 오픈소스 비디오 생성 모델보다도 우수한 성능을 보인다. 이러한 결과는 검증 가능한 강화 학습이 비디오 모델을 지각적 모방 너머 더 신뢰할 수 있는 규칙 일관적 시각 추론으로 나아가게 할 수 있음을 시사한다.

English

Video diffusion models have made rapid progress in perceptual realism and temporal coherence, but they remain primarily optimized for plausible generation rather than verifiable reasoning. This limitation is especially pronounced in tasks where generated videos must satisfy explicit spatial, temporal, or logical constraints. Inspired by the role of reinforcement learning with verifiable rewards (RLVR) in reasoning-oriented language models, we introduce VideoRLVR, a practical recipe for optimizing video diffusion models with rule-based feedback. VideoRLVR formulates video reasoning as the generation of verifiable visual trajectories and consists of an SDE-GRPO optimization backbone, dense decomposed rewards, and an Early-Step Focus strategy for efficient training. The Early-Step Focus strategy restricts policy optimization to the early denoising phase, reducing training latency by about 40% while preserving performance. We evaluate VideoRLVR on Maze, FlowFree, and Sokoban, three procedurally generated domains with objective success criteria. Across these tasks, VideoRLVR consistently improves over supervised fine-tuning baselines, with dense decomposed rewards proving especially important in low-success-rate settings. Our RL-optimized model also outperforms the evaluated proprietary and open-source video generation models on these verifiable reasoning benchmarks and out-of-domain benchmarks. These results suggest that verifiable RL can move video models beyond perceptual imitation toward more reliable rule-consistent visual reasoning.