ワールドR1：テキスト動画生成における3D制約の強化

要旨

近年のビデオ基盤モデルは印象的な映像合成能力を示すが、幾何学的な不整合に悩まされることが多い。既存手法はアーキテクチャ改変による3D事前知識の注入を試みるが、計算コストが高くスケーラビリティが制限されがちである。我々はWorld-R1を提案する。これは強化学習を通じてビデオ生成を3D制約に整合させる枠組みである。この整合を促進するため、世界シミュレーションに特化した専用の純粋テキストデータセットを開発した。Flow-GRPOを活用し、事前学習済み3D基盤モデルと視覚言語モデルからのフィードバックでモデルを最適化。基盤アーキテクチャを変更せずに構造的一貫性を強化する。さらに、周期的分離学習戦略により、厳密な幾何学的一貫性と動的なシーン流動性のバランスを調整。大規模評価により、本手法が基盤モデルの視覚品質を維持しつつ3D一貫性を大幅に向上させ、ビデオ生成とスケーラブルな世界シミュレーションの隔たりを効果的に埋めることを実証した。

English

Recent video foundation models demonstrate impressive visual synthesis but frequently suffer from geometric inconsistencies. While existing methods attempt to inject 3D priors via architectural modifications, they often incur high computational costs and limit scalability. We propose World-R1, a framework that aligns video generation with 3D constraints through reinforcement learning. To facilitate this alignment, we introduce a specialized pure text dataset tailored for world simulation. Utilizing Flow-GRPO, we optimize the model using feedback from pre-trained 3D foundation models and vision-language models to enforce structural coherence without altering the underlying architecture. We further employ a periodic decoupled training strategy to balance rigid geometric consistency with dynamic scene fluidity. Extensive evaluations reveal that our approach significantly enhances 3D consistency while preserving the original visual quality of the foundation model, effectively bridging the gap between video generation and scalable world simulation.

ワールドR1：テキスト動画生成における3D制約の強化

World-R1: Reinforcing 3D Constraints for Text-to-Video Generation

要旨

Support