World-R1：強化文字轉影片生成中的3D約束條件

摘要

近期影片基礎模型展現出令人印象深刻的視覺合成能力，但常存在幾何不一致的問題。現有方法試圖透過架構修改注入3D先驗知識，卻往往伴隨高昂計算成本並限制可擴展性。我們提出World-R1框架，透過強化學習將影片生成與3D約束對齊。為實現此對齊，我們專門構建了適用於世界模擬的純文字資料集。運用Flow-GRPO演算法，我們透過預訓練3D基礎模型與視覺語言模型的回饋訊號進行模型優化，無需改變底層架構即可強化結構連貫性。此外，我們採用週期性解耦訓練策略，在剛性幾何一致性與動態場景流暢度間取得平衡。大量實驗表明，我們的方法在保持基礎模型原始視覺品質的同時，顯著提升3D一致性，有效縮小影片生成與可擴展世界模擬之間的差距。

English

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

World-R1：強化文字轉影片生成中的3D約束條件

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

摘要

Support