可控制的人-物體互動合成

摘要

合成語義感知、長時間跨度的人-物互動對於模擬逼真的人類行為至關重要。在這項工作中，我們解決了在3D場景中生成同步物體運動和人體運動的具有挑戰性問題，這些運動是由語言描述引導的。我們提出了可控人-物互動合成（CHOIS），這是一種方法，它使用條件擴散模型同時生成物體運動和人體運動，並給定語言描述、初始物體和人體狀態，以及稀疏的物體航路點。雖然語言描述提供風格和意圖，航路點則將運動紮根於場景中，並且可以通過高層規劃方法有效地提取。單純應用擴散模型無法預測與輸入航路點對齊的物體運動，也無法確保需要精確手-物體接觸和地板支撐的互動的逼真性。為了克服這些問題，我們引入了物體幾何損失作為額外監督，以改善生成的物體運動與輸入物體航路點之間的匹配。此外，我們設計了引導項，以在訓練擴散模型的採樣過程中強制執行接觸約束。

English

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.