可控人-物体交互合成

摘要

合成语义感知、长期视角下的人-物体交互对于模拟逼真的人类行为至关重要。在这项工作中，我们解决了在3D场景中生成受语言描述引导的物体运动和人类运动的具有挑战性问题。我们提出了可控人-物体交互合成（CHOIS），这是一种方法，通过条件扩散模型同时生成物体运动和人类运动，给定语言描述、初始物体和人类状态，以及稀疏的物体航点。语言描述指导风格和意图，航点将运动与场景联系起来，并可以通过高级规划方法有效地提取。简单地应用扩散模型无法预测与输入航点对齐的物体运动，并且不能确保需要精确手-物体接触和地板支撑的交互的逼真性。为了克服这些问题，我们引入了物体几何损失作为额外监督，以改善生成的物体运动与输入物体航点之间的匹配。此外，我们设计了指导项，以在训练后的扩散模型的采样过程中强制执行接触约束。

English

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.