PhyGenHOI：物理感知的4D動態人-物互動生成

摘要

我們針對生成物理準確且視覺真實的4D人-物互動（HOI）任務進行研究。給定一個靜態3D人體與目標物體（以3D高斯濺射（3DGS）表示），我們的目標是合成動態場景，其中人體根據給定的輸入文字，透過動作（例如出拳或踢腿）主動與物體互動。為此，我們提出PhyGenHOI，這是一個新穎框架，將生成式人體運動與顯式物理物體模擬結合。我們將人體建模為由運動擴散模型（MDM）驅動的語義智能體，將物體建模為透過物質點法（MPM）模擬的物理智能體，並利用3D高斯作為統一且可微分的表示方式。我們透過三種耦合機制監督其互動：(1) 窗口吸引損失，該損失在時間上同步生成式運動以攔截物體；(2) 接觸驅動重模擬步驟，該步驟在碰撞時觸發物理一致的動量傳遞；(3) 遮罩影片SDS目標，該目標注入基於影片的先驗資訊以增強接觸保真度。實驗結果顯示，PhyGenHOI能在多種動作、人體與物體上生成物理一致的4D HOI，優於基線方法。專案頁面與影片：https://omerbenishu.github.io/PhyGenHOI/

English

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/