エゴフォージ：目標志向型エゴセントリック・ワールドシミュレーター

要旨

生成的ワールドモデルは動的環境のシミュレーションにおいて有望な成果を示しているが、エゴセントリック動画については、視点の急激な変化、頻繁な手と物体の相互作用、そして潜在的な人間の意図に依存して展開する目標指向的な手順といった課題により、依然として困難が伴う。既存の手法は、限定的なシーン進化に留まる手中心の指示合成に焦点を当てるか、行動ダイナミクスをモデル化しない静的な視点変換を行うか、あるいはカメラ軌道や長い動画プレフィックス、同期されたマルチカメラ撮影などの密な教師データに依存している。本研究では、EgoForgeを提案する。これは、最小限の静的入力（単一のエゴセントリック画像、高水準の指示、オプションの補助的エクソセントリックビュー）から、首尾一貫した一人称視点の動画の連続生成を可能とする、エゴセントリックで目標指向的なワールドシミュレータである。意図の整合性と時間的一貫性を向上させるため、拡散サンプリング過程において目標達成度、時間的因果性、シーン一貫性、知覚的忠実度を最適化する、軌道レベル報酬誘導型精緻化手法VideoDiffusionNFTを提案する。大規模な実験により、EgoForgeが強力なベースラインと比較して意味的整合性、幾何学的安定性、動作の忠実度において一貫した向上を達成し、現実世界のスマートグラス実験においても堅牢な性能を示すことを確認した。

English

Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.

エゴフォージ：目標志向型エゴセントリック・ワールドシミュレーター

EgoForge: Goal-Directed Egocentric World Simulator

要旨

Support