EgoForge：目标导向的自我中心世界模拟器

摘要

生成式世界模型在动态环境模拟方面展现出潜力，但以自我为中心的视频生成仍面临挑战：视角快速切换、频繁的手物交互，以及受潜在人类意图影响的目标导向型行为演进。现有方法或局限于手部中心的教学合成而缺乏场景演进，或仅实现静态视角转换而未建模动作动态，或依赖密集监督（如相机轨迹、长视频前缀、多相机同步采集等）。本研究提出EgoForge——一种以自我为中心的目标导向世界模拟器，仅需最小化静态输入（单张第一人称图像、高层级指令及可选辅助第三人称视角）即可生成连贯的第一人称视频推演。为提升意图对齐与时序一致性，我们提出VideoDiffusionNFT，这是一种轨迹层级的奖励引导优化方法，在扩散采样过程中同步优化目标完成度、时序因果性、场景一致性与感知保真度。大量实验表明，EgoForge在语义对齐、几何稳定性和运动保真度上均优于基线模型，并在现实智能眼镜实验中展现出鲁棒性能。

English

Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.

EgoForge：目标导向的自我中心世界模拟器

EgoForge: Goal-Directed Egocentric World Simulator

摘要

Support