ONE-SHOT: 공간 분리형 모션 주입 및 하이브리드 컨텍스트 통합을 통한 구성적 인간-환경 비디오 합성

초록

비디오 파운데이션 모델(VFM)의 최근 발전은 인간 중심 비디오 합성을 혁신적으로 변화시켰으나, 대상과 배경의 정밀하고 독립적인 편집은 여전히 중요한 과제로 남아 있습니다. 강체 3D 기하 구성을 통한 풍부한 환경 제어 시도는 정확한 제어와 생성 유연성 사이의 현저한 트레이드오프에 직면하는 경우가 많습니다. 더욱이 복잡한 3D 전처리 과정은 실용적인 확장성을 제한합니다. 본 논문에서는 구성적 인간-환경 비디오 생성을 위한 매개변수 효율적 프레임워크인 ONE-SHOT을 제안합니다. 우리의 핵심 통찰은 생성 과정을 분리된 신호로 분해하는 것입니다. 구체적으로, 크로스 어텐션을 통해 인간 동역학과 환경 신호를 분리하는 표준 공간 주입 메커니즘을 도입합니다. 또한 휴리스틱 3D 정렬 없이 이질적인 공간 영역 간의 공간적 대응 관계를 설정하는 새로운 위치 임베딩 전략인 Dynamic-Grounded-RoPE를 제안합니다. 장기간 합성을 지원하기 위해, 분 단위 생성에 걸쳐 대상과 배경의 일관성을 유지하는 하이브리드 컨텍스트 통합 메커니즘을 도입합니다. 실험 결과, 우리 방법이 최첨단 방법을 크게 능가하며 비디오 합성에 우수한 구조 제어와 창의적 다양성을 제공함을 입증합니다. 우리 프로젝트는 https://martayang.github.io/ONE-SHOT/에서 확인할 수 있습니다.

English

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

ONE-SHOT: 공간 분리형 모션 주입 및 하이브리드 컨텍스트 통합을 통한 구성적 인간-환경 비디오 합성

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

초록

Support