생성된 현실: 손과 카메라 제어를 통한 인터랙티브 비디오 생성 기반 인간 중심 세계 시뮬레이션

초록

확장 현실(XR)은 사용자의 추적된 실세계 동작에 반응하는 생성 모델을 요구하지만, 현재의 비디오 월드 모델은 텍스트나 키보드 입력과 같은 단순한 제어 신호만을 수용하여 구체화된 상호작용의 유용성을 제한하고 있습니다. 본 연구에서는 추적된 머리 포즈와 관절 수준의 손 포즈를 모두 조건으로 하는 인간 중심 비디오 월드 모델을 소개합니다. 이를 위해 기존의 디퓨전 트랜스포머 조건화 전략을 평가하고, 정교한 손-객체 상호작용을 가능하게 하는 3D 머리 및 손 제어 메커니즘을 제안합니다. 해당 전략을 사용하여 양방향 비디오 디퓨전 모델 교사를 훈련시키고, 이를 인과적이며 상호작용적인 시스템으로 전수하여 에고센트릭 가상 환경을 생성합니다. 생성된 현실 시스템을 인간 참가자로 평가한 결과, 관련 베이스라인 대비 향상된 작업 수행 능력과 수행된 행동에 대한 인지된 제어 수준이 유의미하게 높음을 입증했습니다.

English

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

생성된 현실: 손과 카메라 제어를 통한 인터랙티브 비디오 생성 기반 인간 중심 세계 시뮬레이션

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

초록

Support