制御可能な人間-物体インタラクション合成

要旨

意味的に適切で長期的な人間と物体の相互作用を合成することは、現実的な人間の行動をシミュレートするために重要です。本研究では、3Dシーンにおける言語記述に基づいて同期した物体の動きと人間の動きを生成するという難しい問題に取り組みます。私たちは、言語記述、初期の物体と人間の状態、疎な物体のウェイポイントを与えられた条件付き拡散モデルを使用して、物体の動きと人間の動きを同時に生成するアプローチであるControllable Human-Object Interaction Synthesis（CHOIS）を提案します。言語記述はスタイルと意図を伝えますが、ウェイポイントはシーン内での動きを接地させ、高レベルの計画手法を使用して効果的に抽出できます。拡散モデルを単純に適用すると、入力されたウェイポイントと整合する物体の動きを予測できず、正確な手と物体の接触や床に基づいた適切な接触を必要とする相互作用の現実性を保証できません。これらの問題を克服するために、生成された物体の動きと入力された物体のウェイポイントとの整合性を向上させるために、物体の幾何学的損失を追加の監督として導入します。さらに、訓練された拡散モデルのサンプリングプロセス中に接触制約を強制するためのガイダンス項を設計します。

English

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.