EINZELSCHUSS: Kompositionelle Synthese von Mensch-Umgebungs-Videos durch räumlich entkoppelte Bewegungseinspritzung und hybride Kontextintegration

Zusammenfassung

Jüngste Fortschritte bei Video-Foundation-Modellen (VFMs) haben die menschenzentrierte Videosynthese revolutioniert, doch die feingranulare und unabhängige Bearbeitung von Subjekten und Szenen bleibt eine kritische Herausforderung. Aktuelle Ansätze, die eine umfassendere Umgebungskontrolle durch rigide 3D-geometrische Kompositionen integrieren, sehen sich oft mit einem starken Zielkonflikt zwischen präziser Steuerung und generativer Flexibilität konfrontiert. Darüber hinaus schränkt der aufwändige 3D-Vorverarbeitungsprozess die praktische Skalierbarkeit weiterhin ein. In diesem Beitrag stellen wir ONE-SHOT vor, ein parameter-effizientes Framework für kompositionelle Mensch-Umgebungs-Videogenerierung. Unser zentraler Ansatz ist die Faktorisierung des Generierungsprozesses in entkoppelte Signale. Konkret führen wir einen Kanonische-Raum-Injektionsmechanismus ein, der menschliche Dynamik über Cross-Attention von Umgebungshinweisen entkoppelt. Zusätzlich schlagen wir Dynamic-Grounded-RoPE vor, eine neuartige Positionsembedding-Strategie, die räumliche Entsprechungen zwischen unterschiedlichen räumlichen Domänen ohne heuristische 3D-Ausrichtungen herstellt. Um Langzeitsynthesen zu unterstützen, führen wir einen Hybrid-Context-Integration-Mechanismus ein, um die Konsistenz von Subjekt und Szene über minutenlange Generierungen hinweg aufrechtzuerhalten. Experimente zeigen, dass unsere Methode state-of-the-art-Verfahren signifikant übertrifft und eine überlegene strukturelle Kontrolle und kreative Vielfalt für die Videosynthese bietet. Unser Projekt ist verfügbar unter: https://martayang.github.io/ONE-SHOT/.

English

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

EINZELSCHUSS: Kompositionelle Synthese von Mensch-Umgebungs-Videos durch räumlich entkoppelte Bewegungseinspritzung und hybride Kontextintegration

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

Zusammenfassung

Support