SYNTHÈSE VIDÉO COMPOSITIONNELLE HUMAIN-ENVIRONNEMENT EN UNE SEULE IMAGE : Injection de Mouvement Découplée Spatialement et Intégration de Contexte Hybride

Résumé

Les progrès récents des modèles fondamentaux vidéo (VFMs) ont révolutionné la synthèse vidéo centrée sur l'humain, mais l'édition fine et indépendante des sujets et des scènes reste un défi majeur. Les tentatives récentes d'intégrer un contrôle environnemental plus riche via des compositions géométriques 3D rigides se heurtent souvent à un compromis marqué entre contrôle précis et flexibilité générative. De plus, le lourd prétraitement 3D limite encore l'évolutivité pratique. Dans cet article, nous proposons ONE-SHOT, un framework efficace en paramètres pour la génération vidéo compositionnelle humain-environnement. Notre idée clé est de factoriser le processus génératif en signaux désentrelacés. Plus précisément, nous introduisons un mécanisme d'injection dans l'espace canonique qui découple la dynamique humaine des indices environnementaux via une attention croisée. Nous proposons également Dynamic-Grounded-RoPE, une nouvelle stratégie d'encodage positionnel qui établit des correspondances spatiales entre des domaines spatiaux disparates sans alignements 3D heuristiques. Pour supporter la synthèse à long horizon, nous introduisons un mécanisme d'intégration contextuelle hybride pour maintenir la cohérence du sujet et de la scène sur des générations de durée minute. Les expériences démontrent que notre méthode surpasse significativement l'état de l'art, offrant un contrôle structurel supérieur et une diversité créative pour la synthèse vidéo. Notre projet est disponible sur : https://martayang.github.io/ONE-SHOT/.

English

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

SYNTHÈSE VIDÉO COMPOSITIONNELLE HUMAIN-ENVIRONNEMENT EN UNE SEULE IMAGE : Injection de Mouvement Découplée Spatialement et Intégration de Contexte Hybride

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Résumé

Support