CoInteract: 空間構造化共生成による物理整合的な人物-物体インタラクション動画合成

要旨

人間-物体インタラクション（HOI）動画の合成は、eコマース、デジタル広告、仮想マーケティングなどにおいて幅広い実用価値を有する。しかし、現在の拡散モデルは写真的なレンダリング能力を持つにもかかわらず、（i）手や顔などの敏感な領域の構造的安定性と、（ii）物理的に妥当な接触（例：手と物体の相互貫通の回避）において未だ頻繁に失敗する。本論文では、人物参照画像、製品参照画像、テキストプロンプト、音声を条件としたHOI動画合成のためのエンドツーエンドフレームワークであるCoInteractを提案する。CoInteractは、Diffusion Transformer（DiT）バックボーンに組み込まれた2つの相補的な設計を導入する。第一に、空間的に監督されたルーティングによりトークンを軽量な領域特化エキスパートに振り分けるHuman-Aware Mixture-of-Experts（MoE）を提案し、最小限のパラメータオーバーヘッドで微細な構造的忠実性を向上させる。第二に、RGB外観ストリームと補助的HOI構造ストリームを共同でモデル化し、インタラクション幾何学の事前知識を注入するデュアルストリーム訓練パラダイムであるSpatially-Structured Co-Generationを提案する。訓練時には、HOIストリームがRGBトークンに注意を向け、その監督信号が共有バックボーンの重みを正則化する。推論時には、HOIブランチは除去され、オーバーヘッドゼロのRGB生成を実現する。実験結果により、CoInteractが構造的安定性、論理的一貫性、インタラクションの現実性において既存手法を大幅に上回ることを実証する。

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract: 空間構造化共生成による物理整合的な人物-物体インタラクション動画合成

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

要旨

Support