CoInteract：通过空间结构化协同生成实现物理一致的人体-物体交互视频合成

摘要

合成人-物交互(HOI)视频在电子商务、数字广告和虚拟营销领域具有广泛实用价值。然而，当前扩散模型尽管具备照片级真实感渲染能力，仍存在两大瓶颈：(i) 手部、面部等敏感区域的结构稳定性不足；(ii) 物理合理的接触关系缺失（如避免手-物体相互穿透）。我们提出CoInteract——一个端到端的HOI视频生成框架，支持基于人物参考图像、产品参考图像、文本提示和语音音频的多模态条件输入。该框架在扩散Transformer(DiT)主干网络中嵌入两项互补设计：首先，我们提出人类感知混合专家模型，通过空间监督路由将令牌分配给轻量化的区域专属专家，以最小参数开销提升细粒度结构保真度；其次，我们设计空间结构化协同生成机制，采用双流训练范式联合建模RGB外观流与辅助HOI结构流，以此注入交互几何先验。训练阶段HOI流会对RGB令牌进行注意力交互，其监督信号可正则化共享主干权重；推理阶段则移除HOI分支以实现零开销的RGB生成。实验结果表明，CoInteract在结构稳定性、逻辑一致性和交互真实感方面显著超越现有方法。

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract：通过空间结构化协同生成实现物理一致的人体-物体交互视频合成

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

摘要

Support