CoInteract: Fysiek-Consistente Mens-Object Interactie Video-synthese via Ruimtelijk-Gestructureerde Co-generatie

Samenvatting

Het synthetiseren van mens-object interactie (HOI) video's heeft brede praktische waarde in e-commerce, digitale advertenties en virtuele marketing. Huidige diffusiemodellen slagen echter, ondanks hun foto-realistische weergavecapaciteit, nog steeds vaak niet in (i) de structurele stabiliteit van gevoelige regio's zoals handen en gezichten en (ii) fysiek plausibel contact (bijvoorbeeld het vermijden van hand-object interpenetratie). Wij presenteren CoInteract, een end-to-end raamwerk voor HOI-videosynthese, geconditioneerd op een referentiebeeld van een persoon, een referentiebeeld van een product, tekstprompts en spraakaudio. CoInteract introduceert twee complementaire ontwerpen ingebed in een Diffusion Transformer (DiT) backbone. Ten eerste stellen we een Human-Aware Mixture-of-Experts (MoE) voor die tokens routeert naar lichtgewicht, regio-gespecialiseerde experts via ruimtelijk gesuperviseerd routeren, waardoor de fijnmazige structurele betrouwbaarheid verbetert met minimale parameteroverhead. Ten tweede stellen we Spatially-Structured Co-Generation voor, een dual-stream trainingsparadigma dat gezamenlijk een RGB-uiterlijkstroom en een aanvullende HOI-structuurstroom modelleert om interactie-geometrische prioriteiten in te brengen. Tijdens de training let de HOI-stroom op RGB-tokens en regulariseert de supervisie ervan gedeelde backbone-gewichten; tijdens inferentie wordt de HOI-tak verwijderd voor RGB-generatie zonder overhead. Experimentele resultaten tonen aan dat CoInteract bestaande methodes significant overtreft in structurele stabiliteit, logische consistentie en interactie-realisme.

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract: Fysiek-Consistente Mens-Object Interactie Video-synthese via Ruimtelijk-Gestructureerde Co-generatie

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Samenvatting

Support