生成現実：手とカメラ制御によるインタラクティブ映像生成を用いた人間中心の世界シミュレーション

要旨

拡張現実（XR）には、ユーザーの追跡された実世界の動作に応答する生成モデルが求められるが、現在のビデオ世界モデルはテキストやキーボード入力といった粗い制御信号しか受け付けず、具現化されたインタラクションの有用性を制限している。本研究では、追跡された頭部姿勢と関節レベルでの手の姿勢の両方を条件付けとする人間中心のビデオ世界モデルを提案する。この目的のために、既存の拡散トランスフォーマーの条件付け戦略を評価し、3D頭部・手部制御の効果的なメカニズムを提案することで、精巧な手と物体のインタラクションを可能にする。この戦略を用いて双方向ビデオ拡散モデルの教師モデルを学習し、因果的でインタラクティブなシステムに蒸留することで、一人称視点の仮想環境を生成する。この生成現実システムを被験者を用いて評価し、関連するベースラインと比較して、タスク性能の向上と、実行された行動に対する制御感覚が有意に高いレベルであることを実証する。

English

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

生成現実：手とカメラ制御によるインタラクティブ映像生成を用いた人間中心の世界シミュレーション

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

要旨

Support