CoInteract：基於空間結構化協同生成的物理一致人體物體互動影片合成

摘要

合成人類-物體互動（HOI）影片在電子商務、數位廣告與虛擬行銷領域具有廣泛的實用價值。然而，當前擴散模型儘管具備照片級真實的渲染能力，仍經常在以下兩方面表現不佳：（i）手部與臉部等敏感區域的結構穩定性；（ii）符合物理規律的接觸關係（例如避免手部與物體的相互穿透）。我們提出CoInteract——一個端到端的HOI影片合成框架，可基於人物參考圖像、產品參考圖像、文字提示與語音音頻進行條件生成。CoInteract在Diffusion Transformer（DiT）骨幹網絡中嵌入兩種互補設計：首先，我們提出「人體感知專家混合模型」（Human-Aware MoE），透過空間監督路由機制將標記分配給輕量化的區域專家用模組，以最小參數開銷提升細粒度結構保真度；其次，我們設計「空間結構化協同生成」雙流訓練範式，聯合建模RGB外觀流與輔助HOI結構流，注入互動幾何先驗。訓練階段，HOI結構流會關注RGB標記並透過監督訊號正則化共享骨幹權重；推理階段則移除HOI分支，實現零開銷的RGB生成。實驗結果表明，CoInteract在結構穩定性、邏輯一致性與互動真實性方面均顯著優於現有方法。

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract：基於空間結構化協同生成的物理一致人體物體互動影片合成

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

摘要

Support