DomainShuttle: 自由形式オープンドメイン主題駆動型テキストから動画生成

要旨

オープンドメインの主題駆動型テキストから動画への生成（S2V）は、学界と産業界で大きな関心を集めている。オープンドメインS2Vは主に2つのシナリオを含む。ドメイン内（in-domain）は参照主題の特徴を可能な限り保持する必要があり、クロスドメイン（cross-domain）は主題の本質的な特徴を保持しつつ、主題に関係のない属性をテキストプロンプトに応じて柔軟に変化させる。既存手法は主にドメイン内シナリオでの主題忠実度を最大化することに焦点を当てており、新しいスタイル、意味的な組み合わせ、ドメイン属性などのクロスドメインシナリオにおける編集可能性と適応性を制限している。本研究では、理想的なS2V手法は異なるドメイン間を柔軟に移動し、ドメイン内とクロスドメインの両方のシナリオで強力な性能を達成すべきであると提案する。この目的のために、オープンドメインの動画パーソナライゼーションにおいて高い忠実度と生成柔軟性を実現するDomainShuttleを提案する。具体的には、動画と参照特徴を分離し、参照画像のドメイン固有モデリングのためのドメイン認識型AdaLNを導入するDomain-MoTを紹介する。次に、参照画像トークンと動画トークンを別々のRoPE空間に配置して精密な主題レベルの空間モデリングを可能にするVideo-Reference DualRoPEスキームと、無関係な特徴の影響を受けない本質的な主題特徴を抽出することを目的とするCross-Pair Consistent Lossを導入する。広範な実験により、DomainShuttleが既存手法と比較して顕著な性能向上を達成し、多様なオープンドメイン応用シナリオにおいて高い主題忠実度と生成柔軟性を示すことが実証された。

English

Open domain subject-driven text-to-video (S2V) generation has drawn significant interest in academia and industry. Open domain S2V mainly involves two scenarios: in-domain, which requires retaining the reference subject features as much as possible, and cross-domain, which preserves the intrinsic features of the subject while allowing subject-irrelevant properties to vary flexibly according to the text prompt. Existing methods primarily focus on maximizing subject fidelity in in-domain scenarios, which limits their editability and adaptability in cross-domain scenarios, such as novel styles, semantic combinations, or domain attributes. In this study, we propose that an ideal S2V method should flexibly shuttle between different domains, achieving strong performance in both in-domain and cross-domain scenarios. To this end, we propose DomainShuttle, which could achieve high fidelity and generative flexibility for open domain video personalization. Specifically, we introduce Domain-MoT, which decouples videos and reference features and introduces the domain-aware AdaLN for domain-specific modeling of reference images. We then introduce the Video-Reference DualRoPE scheme, which places reference image tokens and video tokens in separate RoPE spaces to enable precise subject-level spatial modeling, and Cross-Pair Consistent Loss, which aims to extract intrinsic subject features unaffected by irrelevant features. Extensive experiments demonstrate that DomainShuttle achieves significant performance improvements over existing methods, exhibiting high subject fidelity and generative flexibility across diverse open domain application scenarios.