CoInteract: 공간 구조적 공동 생성을 통한 물리적 일관성을 갖춘 인간-객체 상호작용 비디오 합성

초록

사람-객체 상호작용(HOI) 동영상 합성은 전자상거래, 디지털 광고, 가상 마케팅 분야에서 폭넓은 실용적 가치를 지닙니다. 그러나 현재의 확산 모델은 사실적인 렌더링 능력을 갖췄음에도 불구하고 (i) 손과 얼굴과 같은 민감한 영역의 구조적 안정성과 (ii) 물리적으로 타당한 접촉(예: 손-객체 간 침투 방지) 측면에서 여전히 종종 실패합니다. 본 논문에서는 사람 참조 이미지, 제품 참조 이미지, 텍스트 프롬프트, 음성 오디오를 조건으로 하는 HOI 동영상 합성을 위한 종단 간(end-to-end) 프레임워크인 CoInteract를 제시합니다. CoInteract는 Diffusion Transformer(DiT) 백본에 내장된 두 가지 상호 보완적인 설계를 도입합니다. 첫째, 공간 기반 지도 라우팅을 통해 토큰을 경량화된 영역 특화 전문가 모듈로 전달하는 Human-Aware Mixture-of-Experts(MoE)를 제안하여, 최소한의 매개변수 오버헤드로 미세한 구조적 정확도를 향상시킵니다. 둘째, RGB 외관 스트림과 보조 HOI 구조 스트림을 공동으로 모델링하여 상호작용 기하학적 사전 지식을 주입하는 이중 스트림 학습 패러다임인 Spatially-Structured Co-Generation을 제안합니다. 학습 동안 HOI 스트림은 RGB 토큰에 주의를 기울이고, 해당 지도 신호는 공유 백본 가중치를 규제합니다. 추론 시에는 HOI 분기를 제거하여 오버헤드 없이 RGB를 생성합니다. 실험 결과, CoInteract가 구조적 안정성, 논리적 일관성 및 상호작용 현실감 측면에서 기존 방법을 크게 능가함을 입증합니다.

English

Synthesizing human--object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand--object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

CoInteract: 공간 구조적 공동 생성을 통한 물리적 일관성을 갖춘 인간-객체 상호작용 비디오 합성

CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

초록

Support