EgoX:基於單一外中心視角影片的自我中心影片生成
EgoX: Egocentric Video Generation from a Single Exocentric Video
December 9, 2025
作者: Taewoong Kang, Kinam Kim, Dohyeon Kim, Minho Park, Junha Hyung, Jaegul Choo
cs.AI
摘要
自我中心感知使人類能夠從自身視角直接體驗和理解世界。將外中心(第三人稱)影片轉換為自我中心(第一人稱)影片,為沉浸式理解開闢了新可能,但由於極端的相機姿態變化和極小的視野重疊,此任務仍極具挑戰性。這項工作需要忠實保留可見內容的同時,以幾何一致的方式合成未見區域。為實現此目標,我們提出EgoX——一個從單一外中心輸入生成自我中心影片的新框架。EgoX通過輕量級LoRA適配機制,利用大規模影片擴散模型的預訓練時空知識,並引入統一條件策略,通過寬度與通道維度拼接融合外中心與自我中心先驗。此外,幾何引導的自注意力機制可選擇性關注空間相關區域,確保幾何連貫性與高視覺保真度。我們的方法能實現連貫且逼真的自我中心影片生成,並在未見過的實境影片中展現出強大的擴展性與魯棒性。
English
Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width and channel wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.