제어 가능한 인간-물체 상호작용 합성

초록

의미적으로 인식 가능하고 장기적인 인간-객체 상호작용을 합성하는 것은 현실적인 인간 행동을 시뮬레이션하는 데 중요합니다. 본 연구에서는 3D 장면에서 언어 설명에 따라 동기화된 객체 운동과 인간 운동을 생성하는 어려운 문제를 다룹니다. 우리는 언어 설명, 초기 객체 및 인간 상태, 그리고 희소 객체 웨이포인트가 주어졌을 때 조건부 확산 모델을 사용하여 객체 운동과 인간 운동을 동시에 생성하는 Controllable Human-Object Interaction Synthesis (CHOIS) 접근법을 제안합니다. 언어 설명은 스타일과 의도를 알려주는 반면, 웨이포인트는 장면 내에서 운동을 기반으로 하며 고수준 계획 방법을 통해 효과적으로 추출될 수 있습니다. 확산 모델을 단순히 적용하는 경우 입력 웨이포인트와 정렬된 객체 운동을 예측하지 못하며, 정확한 손-객체 접촉과 바닥에 기반한 적절한 접촉이 필요한 상호작용의 현실성을 보장할 수 없습니다. 이러한 문제를 극복하기 위해, 우리는 생성된 객체 운동과 입력 객체 웨이포인트 간의 일치를 개선하기 위해 객체 기하학적 손실을 추가적인 감독으로 도입합니다. 또한, 훈련된 확산 모델의 샘플링 과정 동안 접촉 제약을 강제하기 위한 가이던스 항을 설계합니다.

English

Synthesizing semantic-aware, long-horizon, human-object interaction is critical to simulate realistic human behaviors. In this work, we address the challenging problem of generating synchronized object motion and human motion guided by language descriptions in 3D scenes. We propose Controllable Human-Object Interaction Synthesis (CHOIS), an approach that generates object motion and human motion simultaneously using a conditional diffusion model given a language description, initial object and human states, and sparse object waypoints. While language descriptions inform style and intent, waypoints ground the motion in the scene and can be effectively extracted using high-level planning methods. Naively applying a diffusion model fails to predict object motion aligned with the input waypoints and cannot ensure the realism of interactions that require precise hand-object contact and appropriate contact grounded by the floor. To overcome these problems, we introduce an object geometry loss as additional supervision to improve the matching between generated object motion and input object waypoints. In addition, we design guidance terms to enforce contact constraints during the sampling process of the trained diffusion model.