生成現實：結合手部與相機控制的互動式影片生成技術實現以人為本的世界模擬

摘要

延伸實境（XR）需要能響應使用者真實世界動作追蹤的生成模型，然而現有的影片世界模型僅能接受文字或鍵盤輸入等粗略控制訊號，限制了其在具身互動中的實用性。我們提出一種以人為中心的影片世界模型，可同時接收頭部追蹤姿勢與關節級手部姿勢作為條件輸入。為此，我們評估現有的擴散轉換器條件設定策略，並提出有效的三維頭手控制機制，實現精細的手部與物件互動。基於此策略，我們訓練雙向影片擴散模型作為教師模型，並將其蒸餾成可生成以自我為中心虛擬環境的因果互動系統。透過人體受試者評估這套生成現實系統，結果顯示相較於相關基準模型，本系統不僅提升任務執行效能，更顯著提高使用者對執行動作的感知控制程度。

English

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

生成現實：結合手部與相機控制的互動式影片生成技術實現以人為本的世界模擬

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

摘要

Support