World-R1：强化文本到视频生成的三维约束

摘要

近期视频基础模型在视觉合成方面展现出卓越能力，但常存在几何不一致性问题。现有方法试图通过架构修改注入三维先验，但往往伴随高计算成本并限制可扩展性。我们提出World-R1框架，通过强化学习实现视频生成与三维约束的对齐。为此，我们专门构建了适用于世界模拟的纯文本数据集。基于Flow-GRPO算法，利用预训练三维基础模型和视觉语言模型的反馈进行优化，在不改变底层架构的前提下强化结构连贯性。我们进一步采用周期性解耦训练策略，平衡刚性几何一致性与动态场景流畅度。大量实验表明，该方法在保持基础模型原有视觉质量的同时，显著提升了三维一致性，有效弥合了视频生成与可扩展世界模拟之间的鸿沟。

English

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

World-R1：强化文本到视频生成的三维约束

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

摘要

Support