動的な人間とシーンの相互作用モデリングのスケールアップ

要旨

データ不足と高度なモーション合成の課題に直面する中で、人間とシーンとのインタラクション（HSI）モデリングにおいて、TRUMANSデータセットと新たなHSIモーション合成手法を提案する。TRUMANSは、現在利用可能な最も包括的なモーションキャプチャHSIデータセットであり、100の屋内シーンにおける15時間以上の人間のインタラクションを網羅している。このデータセットは、全身の人間の動きと物体の部分レベルのダイナミクスを詳細に捉え、接触のリアリズムに焦点を当てている。さらに、物理環境を正確な仮想モデルに変換し、人間と物体の外観と動きに広範な拡張を適用することで、インタラクションの忠実性を維持しながらデータセットを拡張している。TRUMANSを活用し、シーンの文脈と意図した行動の両方を考慮して、任意の長さのHSIシーケンスを効率的に生成する拡散ベースの自己回帰モデルを考案した。実験では、提案手法がPROX、Replica、ScanNet、ScanNet++などの3Dシーンデータセットにおいて顕著なゼロショット汎化性能を示し、定量実験と人間による評価によって、元のモーションキャプチャシーケンスに極めて近い動きを生成することが確認された。

English

Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling, we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available, encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics, focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS, we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length, taking into account both scene context and intended actions. In experiments, our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g., PROX, Replica, ScanNet, ScanNet++), producing motions that closely mimic original motion-captured sequences, as confirmed by quantitative experiments and human studies.

動的な人間とシーンの相互作用モデリングのスケールアップ

Scaling Up Dynamic Human-Scene Interaction Modeling

要旨

Support