CoInteract : Synthèse vidéo d'interaction humain-objet physiquement cohérente par co-génération spatialement structurée

Résumé

La synthèse de vidéos d'interaction humain-objet (IHO) présente une valeur pratique étendue dans le commerce électronique, la publicité numérique et le marketing virtuel. Cependant, les modèles de diffusion actuels, malgré leur capacité de rendu photoréaliste, échouent encore fréquemment sur (i) la stabilité structurelle des régions sensibles comme les mains et les visages, et (ii) le contact physiquement plausible (par exemple, éviter l'interpénétration main-objet). Nous présentons CoInteract, un framework de bout en bout pour la synthèse vidéo IHO conditionnée par une image de référence d'une personne, une image de référence d'un produit, des invites textuelles et un audio vocal. CoInteract introduit deux conceptions complémentaires intégrées dans un backbone de type Transformer à Diffusion (DiT). Premièrement, nous proposons un Mixture-of-Experts (MoE) conscient de l'humain qui achemine les tokens vers des experts légers spécialisés par région via un routage supervisé spatialement, améliorant la fidélité structurelle fine avec une surcharge paramétrique minimale. Deuxièmement, nous proposons une Co-Génération Spatialement Structurée, un paradigme d'entraînement à double flux qui modélise conjointement un flux d'apparence RVB et un flux auxiliaire de structure IHO pour injecter des préconnaissances géométriques d'interaction. Pendant l'entraînement, le flux IHO prête attention aux tokens RVB et sa supervision régularise les poids partagés du backbone ; lors de l'inférence, la branche IHO est supprimée pour une génération RVB sans surcharge. Les résultats expérimentaux démontrent que CoInteract surpasse significativement les méthodes existantes en termes de stabilité structurelle, de cohérence logique et de réalisme de l'interaction.

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract : Synthèse vidéo d'interaction humain-objet physiquement cohérente par co-génération spatialement structurée

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Résumé

Support