生成式现实：基于手部与相机控制的交互式视频生成技术实现以人为中心的世界模拟

摘要

扩展现实（XR）需要能够响应用户真实世界动作追踪的生成模型，然而现有视频世界模型仅能接受文本或键盘输入等粗略控制信号，这限制了其在具身交互中的应用。我们提出了一种以人为中心的视频世界模型，该模型同时支持头部追踪位姿和关节级手部位姿的条件输入。为此，我们评估了现有扩散变换器的条件控制策略，并提出了一种有效的三维头手控制机制，实现了灵巧的手-物交互。基于该策略，我们训练了双向视频扩散教师模型，并将其蒸馏为可生成以自我为中心虚拟环境的因果交互系统。通过人类受试者评估表明，相较于相关基线，该生成现实系统不仅提升了任务执行效能，还显著提高了用户对执行动作的操控感知水平。

English

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

生成式现实：基于手部与相机控制的交互式视频生成技术实现以人为中心的世界模拟

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

摘要

Support