ワンショット：空間分離モーション注入とハイブリッドコンテキスト統合による構成可能な人間-環境映像合成

要旨

近年、ビデオ基盤モデル（VFM）の進歩は人物中心のビデオ合成に革命をもたらしたが、被写体とシーンの微細かつ独立した編集は依然として重要な課題である。剛体の3D幾何学的構成を通じて豊富な環境制御を組み込もうとする最近の試みは、精密な制御と生成的柔軟性の間で顕著なトレードオフに直面することが多い。さらに、負荷の高い3D前処理は実用的な拡張性を制限している。本論文では、構成可能な人物-環境ビデオ生成のためのパラメータ効率の良いフレームワーク「ONE-SHOT」を提案する。我々の重要な洞察は、生成プロセスを分離された信号に分解することである。具体的には、クロスアテンションにより人物の動態と環境の手がかりを分離する正準空間注入メカニズムを導入する。また、ヒューリスティックな3D位置合わせを一切必要とせずに、異なる空間領域間の空間的対応関係を確立する新しい位置埋め込み戦略「Dynamic-Grounded-RoPE」を提案する。長尺ビデオ合成を支援するため、分単位の生成にわたって被写体とシーンの一貫性を維持するハイブリッドコンテキスト統合メカニズムを導入する。実験により、本手法が既存の最先端手法を大幅に上回り、ビデオ合成において優れた構造制御と創造的多様性を提供することを実証する。本プロジェクトは https://martayang.github.io/ONE-SHOT/ で公開されている。

English

Recent advances in Video Foundation Models (VFMs) have revolutionized human-centric video synthesis, yet fine-grained and independent editing of subjects and scenes remains a critical challenge. Recent attempts to incorporate richer environment control through rigid 3D geometric compositions often encounter a stark trade-off between precise control and generative flexibility. Furthermore, the heavy 3D pre-processing still limits practical scalability. In this paper, we propose ONE-SHOT, a parameter-efficient framework for compositional human-environment video generation. Our key insight is to factorize the generative process into disentangled signals. Specifically, we introduce a canonical-space injection mechanism that decouples human dynamics from environmental cues via cross-attention. We also propose Dynamic-Grounded-RoPE, a novel positional embedding strategy that establishes spatial correspondences between disparate spatial domains without any heuristic 3D alignments. To support long-horizon synthesis, we introduce a Hybrid Context Integration mechanism to maintain subject and scene consistency across minute-level generations. Experiments demonstrate that our method significantly outperforms state-of-the-art methods, offering superior structural control and creative diversity for video synthesis. Our project has been available on: https://martayang.github.io/ONE-SHOT/.

ワンショット：空間分離モーション注入とハイブリッドコンテキスト統合による構成可能な人間-環境映像合成

ONE-SHOT: Compositional Human-Environment Video Synthesis via Spatial-Decoupled Motion Injection and Hybrid Context Integration

要旨

Support